ts_splitting_data_check#

Data check that checks whether the time series training and validation splits have adequate class representation.

Module Contents#

Classes Summary#

TimeSeriesSplittingDataCheck

Checks whether the time series target data is compatible with splitting.

Contents#

class evalml.data_checks.ts_splitting_data_check.TimeSeriesSplittingDataCheck(problem_type, n_splits)[source]#

Checks whether the time series target data is compatible with splitting.

If the target data in the training and validation of every split doesn’t have representation from all classes (for time series classification problems) this will prevent the estimators from training on all potential outcomes which will cause errors during prediction.

Parameters

problem_type (str or ProblemTypes) – Problem type.
n_splits (int) – Number of time series splits.

Methods

`name`	Return a name describing the data check.
`validate`	Check if the training and validation targets are compatible with time series data splitting.

name(cls)#: Return a name describing the data check.

validate(self, X, y)[source]#

Check if the training and validation targets are compatible with time series data splitting.

Parameters

X (pd.DataFrame, np.ndarray) – Ignored. Features.
y (pd.Series, np.ndarray) – Target data.

Returns

dict with a DataCheckError if splitting would result in inadequate class representation.

Return type

dict

Example

>>> import pandas as pd

Passing n_splits as 3 means that the data will be segmented into 4 parts to be iterated over for training and validation splits. The first split results in training indices of [0:25] and validation indices of [25:50]. The training indices of the first split result in only one unique value (0). The third split results in training indices of [0:75] and validation indices of [75:100]. The validation indices of the third split result in only one unique value (1).

>>> X = None
>>> y = pd.Series([0 if i < 45 else i % 2 if i < 55 else 1 for i in range(100)])
>>> ts_splitting_check = TimeSeriesSplittingDataCheck("time series binary", 3)
>>> assert ts_splitting_check.validate(X, y) == [
...     {
...         "message": "Time Series Binary and Time Series Multiclass problem "
...                    "types require every training and validation split to "
...                    "have at least one instance of all the target classes. "
...                    "The following splits are invalid: [1, 3]",
...         "data_check_name": "TimeSeriesSplittingDataCheck",
...         "level": "error",
...         "details": {
...             "columns": None, "rows": None,
...             "invalid_splits": {
...                 1: {"Training": [0, 25]},
...                 3: {"Validation": [75, 100]}
...             }
...         },
...         "code": "TIMESERIES_TARGET_NOT_COMPATIBLE_WITH_SPLIT",
...         "action_options": []
...     }
... ]