uniqueness_data_check¶
Module Contents¶
Classes Summary¶
Checks if there are any columns in the input that are either too unique for classification problems |
Attributes Summary¶
Contents¶
-
class
evalml.data_checks.uniqueness_data_check.
UniquenessDataCheck
(problem_type, threshold=0.5)[source]¶ Checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. e.g. ‘binary’, ‘multiclass’, ‘regression, ‘time series regression’
threshold (float) – The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50.
Methods
Returns a name describing the data check.
This function calculates a uniqueness score for the provided field. NaN values are
Checks if there are any columns in the input that are too unique in the case of classification
-
name
(cls)¶ Returns a name describing the data check.
-
static
uniqueness_score
(col)[source]¶ This function calculates a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation.
Based on the Herfindahl–Hirschman Index.
- Parameters
col (pd.Series) – Feature values.
- Returns
Uniqueness score.
- Return type
(float)
-
validate
(self, X, y=None)[source]¶ Checks if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
- dict with a DataCheckWarning if there are any too unique or not
unique enough columns.
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'regression_unique_enough': [float(x) for x in range(100)], ... 'regression_not_unique_enough': [float(1) for x in range(100)] ... }) >>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8) >>> assert uniqueness_check.validate(df) == { ... "errors": [], ... "warnings": [{"message": "Input columns (regression_not_unique_enough) for regression problem type are not unique enough.", ... "data_check_name": "UniquenessDataCheck", ... "level": "warning", ... "code": "NOT_UNIQUE_ENOUGH", ... "details": {"column": "regression_not_unique_enough", 'uniqueness_score': 0.0}}], ... "actions": [{"code": "DROP_COL", ... "metadata": {"column": "regression_not_unique_enough"}}]}
-
evalml.data_checks.uniqueness_data_check.
warning_not_unique_enough
= Input columns ({}) for {} problem type are not unique enough.¶
-
evalml.data_checks.uniqueness_data_check.
warning_too_unique
= Input columns ({}) for {} problem type are too unique.¶