sparsity_data_check¶
Module Contents¶
Classes Summary¶
Checks if there are any columns with sparsely populated values in the input. |
Attributes Summary¶
Contents¶
-
class
evalml.data_checks.sparsity_data_check.
SparsityDataCheck
(problem_type, threshold, unique_count_threshold=10)[source]¶ Checks if there are any columns with sparsely populated values in the input.
- Parameters
problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.
threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.
unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.
Methods
Returns a name describing the data check.
This function calculates a sparsity score for the given value counts by calculating the percentage of
Calculates what percentage of each column’s unique values exceed the count threshold and compare
-
name
(cls)¶ Returns a name describing the data check.
-
static
sparsity_score
(col, count_threshold=10)[source]¶ This function calculates a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.
- Parameters
col (pd.Series) – Feature values.
count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.
- Returns
Sparsity score, or the percentage of the unique values that exceed count_threshold.
- Return type
(float)
-
validate
(self, X, y=None)[source]¶ Calculates what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.
- Returns
dict with a DataCheckWarning if there are any sparse columns.
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'sparse': [float(x) for x in range(100)], ... 'not_sparse': [float(1) for x in range(100)] ... }) >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10) >>> assert sparsity_check.validate(df) == {"errors": [], "warnings": [{"message": "Input columns (sparse) for multiclass problem type are too sparse.", "data_check_name": "SparsityDataCheck", "level": "warning", "code": "TOO_SPARSE", "details": {"column": "sparse", 'sparsity_score': 0.0}}], "actions": [{"code": "DROP_COL", "metadata": {"column": "sparse"}}]}
-
evalml.data_checks.sparsity_data_check.
warning_too_unique
= Input columns ({}) for {} problem type are too sparse.¶