sparsity_data_check

Data check that checks if there are any columns with sparsely populated values in the input.

Module Contents

Classes Summary

SparsityDataCheck

Check if there are any columns with sparsely populated values in the input.

Attributes Summary

warning_too_unique

Contents

class evalml.data_checks.sparsity_data_check.SparsityDataCheck(problem_type, threshold, unique_count_threshold=10)[source]

Check if there are any columns with sparsely populated values in the input.

Parameters
  • problem_type (str or ProblemTypes) – The specific problem type to data check for. ‘multiclass’ or ‘time series multiclass’ is the only accepted problem type.

  • threshold (float) – The threshold value, or percentage of each column’s unique values, below which, a column exhibits sparsity. Should be between 0 and 1.

  • unique_count_threshold (int) – The minimum number of times a unique value has to be present in a column to not be considered “sparse.” Defaults to 10.

Methods

name

Return a name describing the data check.

sparsity_score

Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

validate

Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

name(cls)

Return a name describing the data check.

static sparsity_score(col, count_threshold=10)[source]

Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold.

Parameters
  • col (pd.Series) – Feature values.

  • count_threshold (int) – The number of instances below which a value is considered sparse. Default is 10.

Returns

Sparsity score, or the percentage of the unique values that exceed count_threshold.

Return type

(float)

validate(self, X, y=None)[source]

Calculate what percentage of each column’s unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored.

Returns

dict with a DataCheckWarning if there are any sparse columns.

Return type

dict

Examples

>>> import pandas as pd

For multiclass problems, if a column doesn’t have enough representation from unique values, it will be considered sparse.

>>> df = pd.DataFrame({
...    "sparse": [float(x) for x in range(100)],
...    "not_sparse": [float(1) for x in range(100)]
... })
...
>>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10)
>>> assert sparsity_check.validate(df) == [
...     {
...         "message": "Input columns ('sparse') for multiclass problem type are too sparse.",
...         "data_check_name": "SparsityDataCheck",
...         "level": "warning",
...         "code": "TOO_SPARSE",
...         "details": {
...             "columns": ["sparse"],
...             "sparsity_score": {"sparse": 0.0},
...             "rows": None
...         },
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                  "data_check_name": "SparsityDataCheck",
...                  "parameters": {},
...                  "metadata": {"columns": ["sparse"], "rows": None}
...             }
...         ]
...     }
... ]

… >>> df[“sparse”] = [float(x % 10) for x in range(100)] >>> sparsity_check = SparsityDataCheck(problem_type=”multiclass”, threshold=1, unique_count_threshold=5) >>> assert sparsity_check.validate(df) == [] … >>> sparse_array = pd.Series([1, 1, 1, 2, 2, 3] * 3) >>> assert SparsityDataCheck.sparsity_score(sparse_array, count_threshold=5) == 0.6666666666666666

evalml.data_checks.sparsity_data_check.warning_too_unique = Input columns ({}) for {} problem type are too sparse.