ts_parameters_data_check¶
Data check that checks whether the time series parameters are compatible with the data size.
Module Contents¶
Classes Summary¶
Checks whether the time series parameters are compatible with data splitting. |
Contents¶
-
class
evalml.data_checks.ts_parameters_data_check.
TimeSeriesParametersDataCheck
(problem_configuration, n_splits)[source]¶ Checks whether the time series parameters are compatible with data splitting.
If gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1)
then the feature engineering window is larger than the smallest split. This will cause the pipeline to create features from data that does not exist, which will cause errors.
- Parameters
problem_configuration (dict) – Dict containing problem_configuration parameters.
n_splits (int) – Number of time series splits.
Methods
Return a name describing the data check.
Check if the time series parameters are compatible with data splitting.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if the time series parameters are compatible with data splitting.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
dict with a DataCheckError if parameters are too big for the split sizes.
- Return type
dict
Examples
>>> import pandas as pd
The time series parameters have to be compatible with the data passed. If the window size (gap + max_delay + forecast_horizon) is greater than or equal to the split size, then an error will be raised.
>>> X = pd.DataFrame({ ... 'dates': pd.date_range("1/1/21", periods=100), ... 'first': [i for i in range(100)], ... }) >>> y = pd.Series([i for i in range(100)]) ... >>> problem_config = {"gap": 7, "max_delay": 2, "forecast_horizon": 12, "time_index": "dates"} >>> target_leakage_check = TimeSeriesParametersDataCheck(problem_configuration=problem_config, n_splits=4) >>> assert target_leakage_check.validate(X, y) == { ... "warnings": [], ... "errors": [{"message": "Since the data has 100 observations and n_splits=4, the smallest " ... "split would have 20 observations. Since 21 (gap + max_delay + forecast_horizon)" ... " >= 20, then at least one of the splits would be empty by the time it reaches " ... "the pipeline. Please use a smaller number of splits, reduce one or more these " ... "parameters, or collect more data.", ... "data_check_name": "TimeSeriesParametersDataCheck", ... "level": "error", ... "code": "TIMESERIES_PARAMETERS_NOT_COMPATIBLE_WITH_SPLIT", ... "details": {'columns': None, ... 'rows': None, ... 'max_window_size': 21, ... 'min_split_size': 20}}], ... "actions": []}