sparsity_data_check ================================================ .. py:module:: evalml.data_checks.sparsity_data_check .. autoapi-nested-parse:: Data check that checks if there are any columns with sparsely populated values in the input. Module Contents --------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.sparsity_data_check.SparsityDataCheck Attributes Summary ~~~~~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.sparsity_data_check.warning_too_unique Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: SparsityDataCheck(problem_type, threshold, unique_count_threshold=10) Check if there are any columns with sparsely populated values in the input. :param problem_type: The specific problem type to data check for. 'multiclass' or 'time series multiclass' is the only accepted problem type. :type problem_type: str or ProblemTypes :param threshold: The threshold value, or percentage of each column's unique values, below which, a column exhibits sparsity. Should be between 0 and 1. :type threshold: float :param unique_count_threshold: The minimum number of times a unique value has to be present in a column to not be considered "sparse." Defaults to 10. :type unique_count_threshold: int **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.sparsity_data_check.SparsityDataCheck.name evalml.data_checks.sparsity_data_check.SparsityDataCheck.sparsity_score evalml.data_checks.sparsity_data_check.SparsityDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: sparsity_score(col, count_threshold=10) :staticmethod: Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold. :param col: Feature values. :type col: pd.Series :param count_threshold: The number of instances below which a value is considered sparse. Default is 10. :type count_threshold: int :returns: Sparsity score, or the percentage of the unique values that exceed count_threshold. :rtype: (float) .. py:method:: validate(self, X, y=None) Calculate what percentage of each column's unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance. :param X: Features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckWarning if there are any sparse columns. :rtype: dict .. rubric:: Examples >>> import pandas as pd For multiclass problems, if a column doesn't have enough representation from unique values, it will be considered sparse. >>> df = pd.DataFrame({ ... "sparse": [float(x) for x in range(100)], ... "not_sparse": [float(1) for x in range(100)] ... }) ... >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10) >>> assert sparsity_check.validate(df) == [ ... { ... "message": "Input columns ('sparse') for multiclass problem type are too sparse.", ... "data_check_name": "SparsityDataCheck", ... "level": "warning", ... "code": "TOO_SPARSE", ... "details": { ... "columns": ["sparse"], ... "sparsity_score": {"sparse": 0.0}, ... "rows": None ... }, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "SparsityDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["sparse"], "rows": None} ... } ... ] ... } ... ] ... >>> df["sparse"] = [float(x % 10) for x in range(100)] >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=1, unique_count_threshold=5) >>> assert sparsity_check.validate(df) == [] ... >>> sparse_array = pd.Series([1, 1, 1, 2, 2, 3] * 3) >>> assert SparsityDataCheck.sparsity_score(sparse_array, count_threshold=5) == 0.6666666666666666 .. py:data:: warning_too_unique :annotation: = Input columns ({}) for {} problem type are too sparse.