sparsity_data_check

Module Contents

Classes Summary

SparsityDataCheck

Checks if there are any columns with sparsely populated values in the input.

Attributes Summary

warning_too_unique

Contents

class evalml.data_checks.sparsity_data_check.SparsityDataCheck(problem_type, threshold, unique_count_threshold=10)[source]

Checks if there are any columns with sparsely populated values in the input.

Parameters
  • problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.

  • threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.

  • unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.

Methods

name

Returns a name describing the data check.

sparsity_score

This function calculates a sparsity score for the given value counts by calculating the percentage of

validate

Calculates what percentage of each column’s unique values exceed the count threshold and compare

name(cls)

Returns a name describing the data check.

static sparsity_score(col, count_threshold=10)[source]

This function calculates a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

Parameters
  • col (pd.Series) – Feature values.

  • count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.

Returns

Sparsity score, or the percentage of the unique values that exceed count_threshold.

Return type

(float)

validate(self, X, y=None)[source]

Calculates what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored.

Returns

dict with a DataCheckWarning if there are any sparse columns.

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...    'sparse': [float(x) for x in range(100)],
...    'not_sparse': [float(1) for x in range(100)]
... })
>>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10)
>>> assert sparsity_check.validate(df) == {"errors": [],                                                       "warnings": [{"message": "Input columns (sparse) for multiclass problem type are too sparse.",                                                            "data_check_name": "SparsityDataCheck",                                                            "level": "warning",                                                            "code": "TOO_SPARSE",                                                            "details": {"column": "sparse", 'sparsity_score': 0.0}}],                                                       "actions": [{"code": "DROP_COL",                                                                 "metadata": {"column": "sparse"}}]}
evalml.data_checks.sparsity_data_check.warning_too_unique = Input columns ({}) for {} problem type are too sparse.