ts_parameters_data_check#

Data check that checks whether the time series parameters are compatible with the data size.

Module Contents#

Classes Summary#

TimeSeriesParametersDataCheck

Checks whether the time series parameters are compatible with data splitting.

Contents#

class evalml.data_checks.ts_parameters_data_check.TimeSeriesParametersDataCheck(problem_configuration, n_splits)[source]#

Checks whether the time series parameters are compatible with data splitting.

If gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1)

then the feature engineering window is larger than the smallest split. This will cause the pipeline to create features from data that does not exist, which will cause errors.

Parameters
  • problem_configuration (dict) – Dict containing problem_configuration parameters.

  • n_splits (int) – Number of time series splits.

Methods

name

Return a name describing the data check.

validate

Check if the time series parameters are compatible with data splitting.

name(cls)#

Return a name describing the data check.

validate(self, X, y=None)[source]#

Check if the time series parameters are compatible with data splitting.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if parameters are too big for the split sizes.

Return type

dict

Examples

>>> import pandas as pd

The time series parameters have to be compatible with the data passed. If the window size (gap + max_delay + forecast_horizon) is greater than or equal to the split size, then an error will be raised.

>>> X = pd.DataFrame({
...    "dates": pd.date_range("1/1/21", periods=100),
...    "first": [i for i in range(100)],
... })
>>> y = pd.Series([i for i in range(100)])
...
>>> problem_config = {"gap": 7, "max_delay": 2, "forecast_horizon": 12, "time_index": "dates"}
>>> target_leakage_check = TimeSeriesParametersDataCheck(problem_configuration=problem_config, n_splits=4)
>>> assert target_leakage_check.validate(X, y) == [
...     {
...         "message": "Since the data has 100 observations and n_splits=4, the smallest "
...                    "split would have 20 observations. Since 21 (gap + max_delay + forecast_horizon)"
...                    " >= 20, then at least one of the splits would be empty by the time it reaches "
...                    "the pipeline. Please use a smaller number of splits, reduce one or more these "
...                    "parameters, or collect more data.",
...         "data_check_name": "TimeSeriesParametersDataCheck",
...         "level": "error",
...         "code": "TIMESERIES_PARAMETERS_NOT_COMPATIBLE_WITH_SPLIT",
...         "details": {
...             "columns": None,
...             "rows": None,
...             "max_window_size": 21,
...             "min_split_size": 20,
...             "n_obs": 100,
...             "n_splits": 4
...         },
...         "action_options": []
...     }
... ]