uniqueness_data_check#

Data check that checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Module Contents#

Classes Summary#

UniquenessDataCheck

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Attributes Summary#

`warning_not_unique_enough`
`warning_too_unique`

Contents#

class evalml.data_checks.uniqueness_data_check.UniquenessDataCheck(problem_type, threshold=0.5)[source]#

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.

Methods

`name`	Return a name describing the data check.
`uniqueness_score`	Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
`validate`	Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

name(cls)#: Return a name describing the data check.

static uniqueness_score(col, drop_na=True)[source]#

Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

Based on the Herfindahl-Hirschman Index.

Parameters

col (pd.Series) – Feature values.
drop_na (bool) – Whether to drop null values when computing the uniqueness score. Defaults to True.

Returns

Uniqueness score.

Return type

(float)

validate(self, X, y=None)[source]#

Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any too unique or not: unique enough columns.

Return type

dict

Examples

>>> import pandas as pd

Because the problem type is regression, the column “regression_not_unique_enough” raises a warning for having just one value.

>>> df = pd.DataFrame({
...    "regression_unique_enough": [float(x) for x in range(100)],
...    "regression_not_unique_enough": [float(1) for x in range(100)]
... })
...
>>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8)
>>> assert uniqueness_check.validate(df) == [
...     {
...         "message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.",
...         "data_check_name": "UniquenessDataCheck",
...         "level": "warning",
...         "code": "NOT_UNIQUE_ENOUGH",
...         "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None},
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "parameters": {},
...                 "data_check_name": "UniquenessDataCheck",
...                 "metadata": {"columns": ["regression_not_unique_enough"], "rows": None}
...             }
...         ]
...     }
... ]

For multiclass, the column “regression_unique_enough” has too many unique values and will raise an appropriate warning. >>> y = pd.Series([1, 1, 1, 2, 2, 3, 3, 3]) >>> uniqueness_check = UniquenessDataCheck(problem_type=”multiclass”, threshold=0.8) >>> assert uniqueness_check.validate(df) == [ … { … “message”: “Input columns ‘regression_unique_enough’ for multiclass problem type are too unique.”, … “data_check_name”: “UniquenessDataCheck”, … “level”: “warning”, … “details”: { … “columns”: [“regression_unique_enough”], … “rows”: None, … “uniqueness_score”: {“regression_unique_enough”: 0.99} … }, … “code”: “TOO_UNIQUE”, … “action_options”: [ … { … “code”: “DROP_COL”, … “data_check_name”: “UniquenessDataCheck”, … “parameters”: {}, … “metadata”: {“columns”: [“regression_unique_enough”], “rows”: None} … } … ] … } … ] … >>> assert UniquenessDataCheck.uniqueness_score(y) == 0.65625

evalml.data_checks.uniqueness_data_check.warning_not_unique_enough = Input columns {} for {} problem type are not unique enough.#

evalml.data_checks.uniqueness_data_check.warning_too_unique = Input columns {} for {} problem type are too unique.#