sparsity_data_check#

Data check that checks if there are any columns with sparsely populated values in the input.

Module Contents#

Classes Summary#

SparsityDataCheck

Check if there are any columns with sparsely populated values in the input.

Attributes Summary#

warning_too_unique

Contents#

class evalml.data_checks.sparsity_data_check.SparsityDataCheck(problem_type, threshold, unique_count_threshold=10)[source]#

Check if there are any columns with sparsely populated values in the input.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.
threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.
unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.

Methods

`name`	Return a name describing the data check.
`sparsity_score`	Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
`validate`	Calculate what percentage of each column's unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

name(cls)#: Return a name describing the data check.

static sparsity_score(col, count_threshold=10)[source]#

Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

Parameters

col (pd.Series) – Feature values.
count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.

Returns

Sparsity score, or the percentage of the unique values that exceed count_threshold.

Return type

(float)

validate(self, X, y=None)[source]#

Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.

Returns

dict with a DataCheckWarning if there are any sparse columns.

Return type

dict

Examples

>>> import pandas as pd

For multiclass problems, if a column doesn’t have enough representation from unique values, it will be considered sparse.

>>> df = pd.DataFrame({
...    "sparse": [float(x) for x in range(100)],
...    "not_sparse": [float(1) for x in range(100)]
... })
...
>>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10)
>>> assert sparsity_check.validate(df) == [
...     {
...         "message": "Input columns ('sparse') for multiclass problem type are too sparse.",
...         "data_check_name": "SparsityDataCheck",
...         "level": "warning",
...         "code": "TOO_SPARSE",
...         "details": {
...             "columns": ["sparse"],
...             "sparsity_score": {"sparse": 0.0},
...             "rows": None
...         },
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                  "data_check_name": "SparsityDataCheck",
...                  "parameters": {},
...                  "metadata": {"columns": ["sparse"], "rows": None}
...             }
...         ]
...     }
... ]

… >>> df[“sparse”] = [float(x % 10) for x in range(100)] >>> sparsity_check = SparsityDataCheck(problem_type=”multiclass”, threshold=1, unique_count_threshold=5) >>> assert sparsity_check.validate(df) == [] … >>> sparse_array = pd.Series([1, 1, 1, 2, 2, 3] * 3) >>> assert SparsityDataCheck.sparsity_score(sparse_array, count_threshold=5) == 0.6666666666666666

evalml.data_checks.sparsity_data_check.warning_too_unique = Input columns ({}) for {} problem type are too sparse.#