uniqueness_data_check¶
Data check that checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.
Module Contents¶
Classes Summary¶
Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems. |
Attributes Summary¶
Contents¶
-
class
evalml.data_checks.uniqueness_data_check.
UniquenessDataCheck
(problem_type, threshold=0.5)[source]¶ Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.
Methods
Return a name describing the data check.
Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.
-
name
(cls)¶ Return a name describing the data check.
-
static
uniqueness_score
(col, drop_na=True)[source]¶ Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
Based on the Herfindahl–Hirschman Index.
- Parameters
col (pd.Series) – Feature values.
drop_na (bool) – Whether to drop null values when computing the uniqueness score. Defaults to True.
- Returns
Uniqueness score.
- Return type
(float)
-
validate
(self, X, y=None)[source]¶ Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
- dict with a DataCheckWarning if there are any too unique or not
unique enough columns.
- Return type
dict
Examples
>>> import pandas as pd ... >>> df = pd.DataFrame({ ... 'regression_unique_enough': [float(x) for x in range(100)], ... 'regression_not_unique_enough': [float(1) for x in range(100)] ... }) ... >>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8) >>> assert uniqueness_check.validate(df) == { ... "errors": [], ... "warnings": [{"message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.", ... "data_check_name": "UniquenessDataCheck", ... "level": "warning", ... "code": "NOT_UNIQUE_ENOUGH", ... "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None}}], ... "actions": [{"code": "DROP_COL", ... "metadata": {"columns": ["regression_not_unique_enough"], "rows": None}}]} ... ... >>> uniqueness_check = UniquenessDataCheck(problem_type="multiclass", threshold=0.8) >>> assert uniqueness_check.validate(df) == { ... 'warnings': [{'message': "Input columns 'regression_unique_enough' for multiclass problem type are too unique.", ... 'data_check_name': 'UniquenessDataCheck', ... 'level': 'warning', ... 'details': {'columns': ['regression_unique_enough'], ... 'rows': None, ... 'uniqueness_score': {'regression_unique_enough': 0.99}}, ... 'code': 'TOO_UNIQUE'}], ... 'errors': [], ... 'actions': [{'code': 'DROP_COL', ... 'metadata': {'columns': ['regression_unique_enough'], 'rows': None}}]} ... >>> y = pd.Series([1, 1, 1, 2, 2, 3, 3, 3]) >>> assert UniquenessDataCheck.uniqueness_score(y) == 0.65625
-
evalml.data_checks.uniqueness_data_check.
warning_not_unique_enough
= Input columns {} for {} problem type are not unique enough.¶
-
evalml.data_checks.uniqueness_data_check.
warning_too_unique
= Input columns {} for {} problem type are too unique.¶