uniqueness_data_check¶

Data check that checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Module Contents¶

Classes Summary¶

UniquenessDataCheck

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Attributes Summary¶

`warning_not_unique_enough`
`warning_too_unique`

Contents¶

class evalml.data_checks.uniqueness_data_check.UniquenessDataCheck(problem_type, threshold=0.5)[source]¶

Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.

Methods

`name`	Return a name describing the data check.
`uniqueness_score`	Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
`validate`	Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

name(cls)¶: Return a name describing the data check.

static uniqueness_score(col)[source]¶

Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.

Based on the Herfindahl–Hirschman Index.

Parameters: col (pd.Series) – Feature values.
Returns: Uniqueness score.
Return type: (float)

validate(self, X, y=None)[source]¶

Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any too unique or not: unique enough columns.

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...    'regression_unique_enough': [float(x) for x in range(100)],
...    'regression_not_unique_enough': [float(1) for x in range(100)]
... })
>>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8)
>>> assert uniqueness_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Input columns (regression_not_unique_enough) for regression problem type are not unique enough.",
...                   "data_check_name": "UniquenessDataCheck",
...                   "level": "warning",
...                   "code": "NOT_UNIQUE_ENOUGH",
...                   "details": {"column": "regression_not_unique_enough", 'uniqueness_score': 0.0}}],
...     "actions": [{"code": "DROP_COL",
...                  "metadata": {"column": "regression_not_unique_enough"}}]}

evalml.data_checks.uniqueness_data_check.warning_not_unique_enough = Input columns ({}) for {} problem type are not unique enough.¶

evalml.data_checks.uniqueness_data_check.warning_too_unique = Input columns ({}) for {} problem type are too unique.¶

target_leakage_data_check utils