sparsity_data_check¶

Module Contents¶

Classes Summary¶

SparsityDataCheck

Checks if there are any columns with sparsely populated values in the input.

Attributes Summary¶

warning_too_unique

Contents¶

class evalml.data_checks.sparsity_data_check.SparsityDataCheck(problem_type, threshold, unique_count_threshold=10)[source]¶

Checks if there are any columns with sparsely populated values in the input.

Parameters

problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.
threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.
unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.

Methods

`name`	Returns a name describing the data check.
`sparsity_score`	This function calculates a sparsity score for the given value counts by calculating the percentage of
`validate`	Calculates what percentage of each column’s unique values exceed the count threshold and compare

name(cls)¶: Returns a name describing the data check.

static sparsity_score(col, count_threshold=10)[source]¶

This function calculates a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

Parameters

col (pd.Series) – Feature values.
count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.

Returns

Sparsity score, or the percentage of the unique values that exceed count_threshold.

Return type

(float)

validate(self, X, y=None)[source]¶

Calculates what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored.

Returns

dict with a DataCheckWarning if there are any sparse columns.

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...    'sparse': [float(x) for x in range(100)],
...    'not_sparse': [float(1) for x in range(100)]
... })
>>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10)
>>> assert sparsity_check.validate(df) == {"errors": [],                                                       "warnings": [{"message": "Input columns (sparse) for multiclass problem type are too sparse.",                                                            "data_check_name": "SparsityDataCheck",                                                            "level": "warning",                                                            "code": "TOO_SPARSE",                                                            "details": {"column": "sparse", 'sparsity_score': 0.0}}],                                                       "actions": [{"code": "DROP_COL",                                                                 "metadata": {"column": "sparse"}}]}

evalml.data_checks.sparsity_data_check.warning_too_unique = Input columns ({}) for {} problem type are too sparse.¶

outliers_data_check target_distribution_data_check