uniqueness_data_check ================================================== .. py:module:: evalml.data_checks.uniqueness_data_check .. autoapi-nested-parse:: Data check that checks if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems. Module Contents --------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.uniqueness_data_check.UniquenessDataCheck Attributes Summary ~~~~~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.uniqueness_data_check.warning_not_unique_enough evalml.data_checks.uniqueness_data_check.warning_too_unique Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: UniquenessDataCheck(problem_type, threshold=0.5) Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems. :param problem_type: The specific problem type to data check for. e.g. 'binary', 'multiclass', 'regression, 'time series regression' :type problem_type: str or ProblemTypes :param threshold: The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50. :type threshold: float **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.uniqueness_data_check.UniquenessDataCheck.name evalml.data_checks.uniqueness_data_check.UniquenessDataCheck.uniqueness_score evalml.data_checks.uniqueness_data_check.UniquenessDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: uniqueness_score(col, drop_na=True) :staticmethod: Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation. Based on the Herfindahl-Hirschman Index. :param col: Feature values. :type col: pd.Series :param drop_na: Whether to drop null values when computing the uniqueness score. Defaults to True. :type drop_na: bool :returns: Uniqueness score. :rtype: (float) .. py:method:: validate(self, X, y=None) Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems. :param X: Features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. Defaults to None. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckWarning if there are any too unique or not unique enough columns. :rtype: dict .. rubric:: Examples >>> import pandas as pd Because the problem type is regression, the column "regression_not_unique_enough" raises a warning for having just one value. >>> df = pd.DataFrame({ ... "regression_unique_enough": [float(x) for x in range(100)], ... "regression_not_unique_enough": [float(1) for x in range(100)] ... }) ... >>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8) >>> assert uniqueness_check.validate(df) == [ ... { ... "message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.", ... "data_check_name": "UniquenessDataCheck", ... "level": "warning", ... "code": "NOT_UNIQUE_ENOUGH", ... "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "parameters": {}, ... "data_check_name": "UniquenessDataCheck", ... "metadata": {"columns": ["regression_not_unique_enough"], "rows": None} ... } ... ] ... } ... ] For multiclass, the column "regression_unique_enough" has too many unique values and will raise an appropriate warning. >>> y = pd.Series([1, 1, 1, 2, 2, 3, 3, 3]) >>> uniqueness_check = UniquenessDataCheck(problem_type="multiclass", threshold=0.8) >>> assert uniqueness_check.validate(df) == [ ... { ... "message": "Input columns 'regression_unique_enough' for multiclass problem type are too unique.", ... "data_check_name": "UniquenessDataCheck", ... "level": "warning", ... "details": { ... "columns": ["regression_unique_enough"], ... "rows": None, ... "uniqueness_score": {"regression_unique_enough": 0.99} ... }, ... "code": "TOO_UNIQUE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "UniquenessDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["regression_unique_enough"], "rows": None} ... } ... ] ... } ... ] ... >>> assert UniquenessDataCheck.uniqueness_score(y) == 0.65625 .. py:data:: warning_not_unique_enough :annotation: = Input columns {} for {} problem type are not unique enough. .. py:data:: warning_too_unique :annotation: = Input columns {} for {} problem type are too unique.