mismatched_series_length_data_check#

Data check that checks if one or more unique series in a multiseres data is a different length than the others.

Module Contents#

Classes Summary#

MismatchedSeriesLengthDataCheck

Check if one or more unique series in a multiseries dataset is of a different length than the others.

Contents#

class evalml.data_checks.mismatched_series_length_data_check.MismatchedSeriesLengthDataCheck(series_id)[source]#

Check if one or more unique series in a multiseries dataset is of a different length than the others.

Currently works specifically on stacked data

Parameters

series_id (str) – The name of the series_id column for the dataset.

Methods

name

Return a name describing the data check.

validate

Check if one or more unique series in a multiseries dataset is of a different length than the other.

name(cls)#

Return a name describing the data check.

validate(self, X, y=None)[source]#

Check if one or more unique series in a multiseries dataset is of a different length than the other.

Currently works specifically on stacked data

Parameters
  • X (pd.DataFrame, np.ndarray) – The input features to check. Must have a series_id column.

  • y (pd.Series) – The target. Defaults to None. Ignored.

Returns

List with DataCheckWarning if there are mismatch series length in the datasets

or list with DataCheckError if the given series_id is not in the dataset

Return type

dict (DataCheckWarning, DataCheckError)

Examples

>>> import pandas as pd

For multiseries time series datasets, each seriesID should ideally have the same number of datetime entries as each other. If they don’t, then a warning will be raised denoting which seriesID have mismatched lengths.

>>> X = pd.DataFrame(
...     {
...         "date": pd.date_range(start="1/1/2018", periods=20).repeat(5),
...         "series_id": pd.Series(list(range(5)) * 20, dtype="str"),
...         "feature_a": range(100),
...         "feature_b": reversed(range(100)),
...     },
... )
>>> X = X.drop(labels=0, axis=0)
>>> mismatched_series_length_check = MismatchedSeriesLengthDataCheck("series_id")
>>> assert mismatched_series_length_check.validate(X) == [
...      {
...         "message": "Series ID ['0'] do not match the majority length of the other series, which is 20",
...         "data_check_name": "MismatchedSeriesLengthDataCheck",
...         "level": "warning",
...         "details": {
...             "columns": None,
...             "rows": None,
...             "series_id": ['0'],
...             "majority_length": 20
...         },
...         "code": "MISMATCHED_SERIES_LENGTH",
...         "action_options": [],
...     }
... ]

If MismatchedSeriesLengthDataCheck is passed in an invalid series_id column name, then an error will be raised.

>>> X = pd.DataFrame(
...     {
...         "date": pd.date_range(start="1/1/2018", periods=20).repeat(5),
...         "series_id": pd.Series(list(range(5)) * 20, dtype="str"),
...         "feature_a": range(100),
...         "feature_b": reversed(range(100)),
...     },
... )
>>> X = X.drop(labels=0, axis=0)
>>> mismatched_series_length_check = MismatchedSeriesLengthDataCheck("not_series_id")
>>> assert mismatched_series_length_check.validate(X) == [
...      {
...         "message": "series_id 'not_series_id' is not in the dataset.",
...         "data_check_name": "MismatchedSeriesLengthDataCheck",
...         "level": "error",
...         "details": {
...             "columns": None,
...             "rows": None,
...             "series_id": "not_series_id",
...         },
...         "code": "INVALID_SERIES_ID_COL",
...         "action_options": [],
...     }
... ]

If there are multiple lengths that have the same number of series (e.g. two series have length 20 and two series have length 19), this datacheck will consider the higher length to be the majority length (e.g. from the previous example length 20 would be the majority length) >>> X = pd.DataFrame( … { … “date”: pd.date_range(start=”1/1/2018”, periods=20).repeat(4), … “series_id”: pd.Series(list(range(4)) * 20, dtype=”str”), … “feature_a”: range(80), … “feature_b”: reversed(range(80)), … }, … ) >>> X = X.drop(labels=[0, 1], axis=0) >>> mismatched_series_length_check = MismatchedSeriesLengthDataCheck(“series_id”) >>> assert mismatched_series_length_check.validate(X) == [ … { … “message”: “Series ID [‘0’, ‘1’] do not match the majority length of the other series, which is 20”, … “data_check_name”: “MismatchedSeriesLengthDataCheck”, … “level”: “warning”, … “details”: { … “columns”: None, … “rows”: None, … “series_id”: [‘0’, ‘1’], … “majority_length”: 20 … }, … “code”: “MISMATCHED_SERIES_LENGTH”, … “action_options”: [], … } … ]