sparsity_data_check¶
Data check that checks if there are any columns with sparsely populated values in the input.
Module Contents¶
Classes Summary¶
Check if there are any columns with sparsely populated values in the input. |
Attributes Summary¶
Contents¶
-
class
evalml.data_checks.sparsity_data_check.
SparsityDataCheck
(problem_type, threshold, unique_count_threshold=10)[source]¶ Check if there are any columns with sparsely populated values in the input.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.
threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.
unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.
Methods
Return a name describing the data check.
Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.
-
name
(cls)¶ Return a name describing the data check.
-
static
sparsity_score
(col, count_threshold=10)[source]¶ Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
- Parameters
col (pd.Series) – Feature values.
count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.
- Returns
Sparsity score, or the percentage of the unique values that exceed count_threshold.
- Return type
(float)
-
validate
(self, X, y=None)[source]¶ Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.
- Returns
dict with a DataCheckWarning if there are any sparse columns.
- Return type
dict
Examples
>>> import pandas as pd
For multiclass problems, if a column doesn’t have enough representation from unique values, it will be considered sparse.
>>> df = pd.DataFrame({ ... "sparse": [float(x) for x in range(100)], ... "not_sparse": [float(1) for x in range(100)] ... }) ... >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10) >>> assert sparsity_check.validate(df) == [ ... { ... "message": "Input columns ('sparse') for multiclass problem type are too sparse.", ... "data_check_name": "SparsityDataCheck", ... "level": "warning", ... "code": "TOO_SPARSE", ... "details": { ... "columns": ["sparse"], ... "sparsity_score": {"sparse": 0.0}, ... "rows": None ... }, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "SparsityDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["sparse"], "rows": None} ... } ... ] ... } ... ]
… >>> df[“sparse”] = [float(x % 10) for x in range(100)] >>> sparsity_check = SparsityDataCheck(problem_type=”multiclass”, threshold=1, unique_count_threshold=5) >>> assert sparsity_check.validate(df) == [] … >>> sparse_array = pd.Series([1, 1, 1, 2, 2, 3] * 3) >>> assert SparsityDataCheck.sparsity_score(sparse_array, count_threshold=5) == 0.6666666666666666
-
evalml.data_checks.sparsity_data_check.
warning_too_unique
= Input columns ({}) for {} problem type are too sparse.¶