ts_parameters_data_check

Data check that checks whether the time series parameters are compatible with the data size.

Module Contents

Classes Summary

TimeSeriesParametersDataCheck

Checks whether the time series parameters are compatible with data splitting.

Contents

class evalml.data_checks.ts_parameters_data_check.TimeSeriesParametersDataCheck(problem_configuration, n_splits)[source]

Checks whether the time series parameters are compatible with data splitting.

If gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1)

then the feature engineering window is larger than the smallest split. This will cause the pipeline to create features from data that does not exist, which will cause errors.

Parameters
  • problem_configuration (dict) – Dict containing problem_configuration parameters.

  • n_splits (int) – Number of time series splits.

Methods

name

Return a name describing the data check.

validate

Check if the time series parameters are compatible with data splitting.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if the time series parameters are compatible with data splitting.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if parameters are too big for the split sizes.

Return type

dict

Examples

>>> import pandas as pd

The time series parameters have to be compatible with the data passed. If the window size (gap + max_delay + forecast_horizon) is greater than or equal to the split size, then an error will be raised.

>>> X = pd.DataFrame({
...    'dates': pd.date_range("1/1/21", periods=100),
...    'first': [i for i in range(100)],
... })
>>> y = pd.Series([i for i in range(100)])
...
>>> problem_config = {"gap": 7, "max_delay": 2, "forecast_horizon": 12, "time_index": "dates"}
>>> target_leakage_check = TimeSeriesParametersDataCheck(problem_configuration=problem_config, n_splits=4)
>>> assert target_leakage_check.validate(X, y) == {
...     "warnings": [],
...     "errors": [{"message": "Since the data has 100 observations and n_splits=4, the smallest "
...                            "split would have 20 observations. Since 21 (gap + max_delay + forecast_horizon)"
...                            " >= 20, then at least one of the splits would be empty by the time it reaches "
...                            "the pipeline. Please use a smaller number of splits, reduce one or more these "
...                            "parameters, or collect more data.",
...                 "data_check_name": "TimeSeriesParametersDataCheck",
...                 "level": "error",
...                 "code": "TIMESERIES_PARAMETERS_NOT_COMPATIBLE_WITH_SPLIT",
...                 "details": {'columns': None,
...                             'rows': None,
...                             'max_window_size': 21,
...                             'min_split_size': 20}}],
...     "actions": []}