sparsity_data_check¶

Data check that checks if there are any columns with sparsely populated values in the input.

Module Contents¶

Classes Summary¶

SparsityDataCheck

Check if there are any columns with sparsely populated values in the input.

Attributes Summary¶

warning_too_unique

Contents¶

class evalml.data_checks.sparsity_data_check.SparsityDataCheck(problem_type, threshold, unique_count_threshold=10)[source]¶

Check if there are any columns with sparsely populated values in the input.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.
threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.
unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.

Methods

`name`	Return a name describing the data check.
`sparsity_score`	Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
`validate`	Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

name(cls)¶: Return a name describing the data check.

static sparsity_score(col, count_threshold=10)[source]¶

Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

Parameters

col (pd.Series) – Feature values.
count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.

Returns

Sparsity score, or the percentage of the unique values that exceed count_threshold.

Return type

(float)

validate(self, X, y=None)[source]¶

Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.

Returns

dict with a DataCheckWarning if there are any sparse columns.

Return type

dict

Examples

>>> import pandas as pd

For multiclass problems, if a column doesn’t have enough representation from unique values, it will be considered sparse.

>>> df = pd.DataFrame({
...    "sparse": [float(x) for x in range(100)],
...    "not_sparse": [float(1) for x in range(100)]
... })
...
>>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10)
>>> assert sparsity_check.validate(df) == [
...     {
...         "message": "Input columns ('sparse') for multiclass problem type are too sparse.",
...         "data_check_name": "SparsityDataCheck",
...         "level": "warning",
...         "code": "TOO_SPARSE",
...         "details": {
...             "columns": ["sparse"],
...             "sparsity_score": {"sparse": 0.0},
...             "rows": None
...         },
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                  "data_check_name": "SparsityDataCheck",
...                  "parameters": {},
...                  "metadata": {"columns": ["sparse"], "rows": None}
...             }
...         ]
...     }
... ]

… >>> df[“sparse”] = [float(x % 10) for x in range(100)] >>> sparsity_check = SparsityDataCheck(problem_type=”multiclass”, threshold=1, unique_count_threshold=5) >>> assert sparsity_check.validate(df) == [] … >>> sparse_array = pd.Series([1, 1, 1, 2, 2, 3] * 3) >>> assert SparsityDataCheck.sparsity_score(sparse_array, count_threshold=5) == 0.6666666666666666

evalml.data_checks.sparsity_data_check.warning_too_unique = Input columns ({}) for {} problem type are too sparse.¶

outliers_data_check

target_distribution_data_check