ts_parameters_data_check¶

Data check that checks whether the time series parameters are compatible with the data size.

Module Contents¶

Classes Summary¶

TimeSeriesParametersDataCheck

Checks whether the time series parameters are compatible with data splitting.

Contents¶

class evalml.data_checks.ts_parameters_data_check.TimeSeriesParametersDataCheck(problem_configuration, n_splits)[source]¶

Checks whether the time series parameters are compatible with data splitting.

If gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1)

then the feature engineering window is larger than the smallest split. This will cause the pipeline to create features from data that does not exist, which will cause errors.

Parameters

problem_configuration (dict) – Dict containing problem_configuration parameters.
n_splits (int) – Number of time series splits.

Methods

`name`	Return a name describing the data check.
`validate`	Check if the time series parameters are compatible with data splitting.

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if the time series parameters are compatible with data splitting.

Parameters

X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if parameters are too big for the split sizes.

Return type

dict

Examples

>>> import pandas as pd

The time series parameters have to be compatible with the data passed. If the window size (gap + max_delay + forecast_horizon) is greater than or equal to the split size, then an error will be raised.

>>> X = pd.DataFrame({
...    'dates': pd.date_range("1/1/21", periods=100),
...    'first': [i for i in range(100)],
... })
>>> y = pd.Series([i for i in range(100)])
...
>>> problem_config = {"gap": 7, "max_delay": 2, "forecast_horizon": 12, "time_index": "dates"}
>>> target_leakage_check = TimeSeriesParametersDataCheck(problem_configuration=problem_config, n_splits=4)
>>> assert target_leakage_check.validate(X, y) == {
...     "warnings": [],
...     "errors": [{"message": "Since the data has 100 observations and n_splits=4, the smallest "
...                            "split would have 20 observations. Since 21 (gap + max_delay + forecast_horizon)"
...                            " >= 20, then at least one of the splits would be empty by the time it reaches "
...                            "the pipeline. Please use a smaller number of splits, reduce one or more these "
...                            "parameters, or collect more data.",
...                 "data_check_name": "TimeSeriesParametersDataCheck",
...                 "level": "error",
...                 "code": "TIMESERIES_PARAMETERS_NOT_COMPATIBLE_WITH_SPLIT",
...                 "details": {'columns': None,
...                             'rows': None,
...                             'max_window_size': 21,
...                             'min_split_size': 20}}],
...     "actions": []}

target_leakage_data_check

ts_splitting_data_check