uniqueness_data_check¶

Data check that checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Module Contents¶

Classes Summary¶

UniquenessDataCheck

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Attributes Summary¶

`warning_not_unique_enough`
`warning_too_unique`

Contents¶

class evalml.data_checks.uniqueness_data_check.UniquenessDataCheck(problem_type, threshold=0.5)[source]¶

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.

Methods

`name`	Return a name describing the data check.
`uniqueness_score`	Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
`validate`	Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

name(cls)¶: Return a name describing the data check.

static uniqueness_score(col, drop_na=True)[source]¶

Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

Based on the Herfindahl–Hirschman Index.

Parameters

col (pd.Series) – Feature values.
drop_na (bool) – Whether to drop null values when computing the uniqueness score. Defaults to True.

Returns

Uniqueness score.

Return type

(float)

validate(self, X, y=None)[source]¶

Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any too unique or not: unique enough columns.

Return type

dict

Examples

>>> import pandas as pd

Because the problem type is regression, the column “regression_not_unique_enough” raises a warning for having just one value.

>>> df = pd.DataFrame({
...    'regression_unique_enough': [float(x) for x in range(100)],
...    'regression_not_unique_enough': [float(1) for x in range(100)]
... })
...
>>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.",
...                   "data_check_name": "UniquenessDataCheck",
...                   "level": "warning",
...                   "code": "NOT_UNIQUE_ENOUGH",
...                   "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "data_check_name": 'UniquenessDataCheck',
...                  "metadata": {"columns": ["regression_not_unique_enough"], "rows": None}}]}

For multiclass, the column “regression_unique_enough” has too many unique values and will raise an appropriate warning.

>>> uniqueness_check = UniquenessDataCheck(problem_type="multiclass", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {
...     'warnings': [{'message': "Input columns 'regression_unique_enough' for multiclass problem type are too unique.",
...                   'data_check_name': 'UniquenessDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['regression_unique_enough'],
...                               'rows': None,
...                               'uniqueness_score': {'regression_unique_enough': 0.99}},
...                   'code': 'TOO_UNIQUE'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  'data_check_name': 'UniquenessDataCheck',
...                  'metadata': {'columns': ['regression_unique_enough'], 'rows': None}}]}

>>> y = pd.Series([1, 1, 1, 2, 2, 3, 3, 3])
>>> assert UniquenessDataCheck.uniqueness_score(y) == 0.65625

evalml.data_checks.uniqueness_data_check.warning_not_unique_enough = Input columns {} for {} problem type are not unique enough.¶

evalml.data_checks.uniqueness_data_check.warning_too_unique = Input columns {} for {} problem type are too unique.¶

ts_splitting_data_check

utils