uniqueness_data_check

Data check that checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Module Contents

Classes Summary

UniquenessDataCheck

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Contents

class evalml.data_checks.uniqueness_data_check.UniquenessDataCheck(problem_type, threshold=0.5)[source]

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Parameters
  • problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’

  • threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.

Methods

name

Return a name describing the data check.

uniqueness_score

Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

validate

Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

name(cls)

Return a name describing the data check.

static uniqueness_score(col, drop_na=True)[source]

Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

Based on the Herfindahl–Hirschman Index.

Parameters
  • col (pd.Series) – Feature values.

  • drop_na (bool) – Whether to drop null values when computing the uniqueness score. Defaults to True.

Returns

Uniqueness score.

Return type

(float)

validate(self, X, y=None)[source]

Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any too unique or not

unique enough columns.

Return type

dict

Examples

>>> import pandas as pd

Because the problem type is regression, the column “regression_not_unique_enough” raises a warning for having just one value.

>>> df = pd.DataFrame({
...    'regression_unique_enough': [float(x) for x in range(100)],
...    'regression_not_unique_enough': [float(1) for x in range(100)]
... })
...
>>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.",
...                   "data_check_name": "UniquenessDataCheck",
...                   "level": "warning",
...                   "code": "NOT_UNIQUE_ENOUGH",
...                   "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None}}],
...     "actions": [{"code": "DROP_COL",
...                  "data_check_name": 'UniquenessDataCheck',
...                  "metadata": {"columns": ["regression_not_unique_enough"], "rows": None}}]}

For multiclass, the column “regression_unique_enough” has too many unique values and will raise an appropriate warning.

>>> uniqueness_check = UniquenessDataCheck(problem_type="multiclass", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {
...     'warnings': [{'message': "Input columns 'regression_unique_enough' for multiclass problem type are too unique.",
...                   'data_check_name': 'UniquenessDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['regression_unique_enough'],
...                               'rows': None,
...                               'uniqueness_score': {'regression_unique_enough': 0.99}},
...                   'code': 'TOO_UNIQUE'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  'data_check_name': 'UniquenessDataCheck',
...                  'metadata': {'columns': ['regression_unique_enough'], 'rows': None}}]}
>>> y = pd.Series([1, 1, 1, 2, 2, 3, 3, 3])
>>> assert UniquenessDataCheck.uniqueness_score(y) == 0.65625
evalml.data_checks.uniqueness_data_check.warning_not_unique_enough = Input columns {} for {} problem type are not unique enough.
evalml.data_checks.uniqueness_data_check.warning_too_unique = Input columns {} for {} problem type are too unique.