uniqueness_data_check¶

Module Contents¶

Classes Summary¶

UniquenessDataCheck

Checks if there are any columns in the input that are either too unique for classification problems

Attributes Summary¶

`warning_not_unique_enough`
`warning_too_unique`

Contents¶

class evalml.data_checks.uniqueness_data_check.UniquenessDataCheck(problem_type, threshold=0.5)[source]¶

Checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.

Methods

`name`	Returns a name describing the data check.
`uniqueness_score`	This function calculates a uniqueness score for the provided field. NaN values are
`validate`	Checks if there are any columns in the input that are too unique in the case of classification

name(cls)¶: Returns a name describing the data check.

static uniqueness_score(col)[source]¶

This function calculates a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

Based on the Herfindahl–Hirschman Index.

Parameters: col (pd.Series) – Feature values.
Returns: Uniqueness score.
Return type: (float)

validate(self, X, y=None)[source]¶

Checks if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any too unique or not: unique enough columns.

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...    'regression_unique_enough': [float(x) for x in range(100)],
...    'regression_not_unique_enough': [float(1) for x in range(100)]
... })
>>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {"errors": [],                                                         "warnings": [{"message": "Input columns (regression_not_unique_enough) for regression problem type are not unique enough.",                                                                 "data_check_name": "UniquenessDataCheck",                                                                 "level": "warning",                                                                 "code": "NOT_UNIQUE_ENOUGH",                                                                 "details": {"column": "regression_not_unique_enough", 'uniqueness_score': 0.0}}],                                                         "actions": [{"code": "DROP_COL",                                                                      "metadata": {"column": "regression_not_unique_enough"}}]}

evalml.data_checks.uniqueness_data_check.warning_not_unique_enough = Input columns ({}) for {} problem type are not unique enough.¶

evalml.data_checks.uniqueness_data_check.warning_too_unique = Input columns ({}) for {} problem type are too unique.¶

target_leakage_data_check utils