time_series_split#

Rolling Origin Cross Validation for time series problems.

Module Contents#

Classes Summary#

TimeSeriesSplit

Rolling Origin Cross Validation for time series problems.

Contents#

class evalml.preprocessing.data_splitters.time_series_split.TimeSeriesSplit(max_delay=0, gap=0, forecast_horizon=None, time_index=None, n_series=None, n_splits=3)[source]#

Rolling Origin Cross Validation for time series problems.

The max_delay, gap, and forecast_horizon parameters are only used to validate that the requested split size is not too small given these parameters.

Parameters

max_delay (int) – Max delay value for feature engineering. Time series pipelines create delayed features from existing features. This process will introduce NaNs into the first max_delay number of rows. The splitter uses the last max_delay number of rows from the previous split as the first max_delay number of rows of the current split to avoid “throwing out” more data than in necessary. Defaults to 0.
gap (int) – Number of time units separating the data used to generate features and the data to forecast on. Defaults to 0.
forecast_horizon (int, None) – Number of time units to forecast. Used for parameter validation. If an integer, will set the size of the cv splits. Defaults to None.
time_index (str) – Name of the column containing the datetime information used to order the data. Defaults to None.
n_splits (int) – number of data splits to make. Defaults to 3.

Example

>>> import numpy as np
>>> import pandas as pd
...
>>> X = pd.DataFrame([i for i in range(10)], columns=["First"])
>>> y = pd.Series([i for i in range(10)])
...
>>> ts_split = TimeSeriesSplit(n_splits=4)
>>> generator_ = ts_split.split(X, y)
...
>>> first_split = next(generator_)
>>> assert (first_split[0] == np.array([0, 1])).all()
>>> assert (first_split[1] == np.array([2, 3])).all()
...
...
>>> second_split = next(generator_)
>>> assert (second_split[0] == np.array([0, 1, 2, 3])).all()
>>> assert (second_split[1] == np.array([4, 5])).all()
...
...
>>> third_split = next(generator_)
>>> assert (third_split[0] == np.array([0, 1, 2, 3, 4, 5])).all()
>>> assert (third_split[1] == np.array([6, 7])).all()
...
...
>>> fourth_split = next(generator_)
>>> assert (fourth_split[0] == np.array([0, 1, 2, 3, 4, 5, 6, 7])).all()
>>> assert (fourth_split[1] == np.array([8, 9])).all()

Methods

`get_metadata_routing`	Get metadata routing of this object.
`get_n_splits`	Get the number of data splits.
`is_cv`	Returns whether or not the data splitter is a cross-validation data splitter.
`split`	Get the time series splits.

get_metadata_routing(self)#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns: routing – A MetadataRequest encapsulating routing information.
Return type: MetadataRequest

get_n_splits(self, X=None, y=None, groups=None)[source]#

Get the number of data splits.

Parameters

X (pd.DataFrame, None) – Features to split.
y (pd.DataFrame, None) – Target variable to split. Defaults to None.
groups – Ignored but kept for compatibility with sklearn API. Defaults to None.

Returns

Number of splits.

property is_cv(self)#

Returns whether or not the data splitter is a cross-validation data splitter.

Returns: If the splitter is a cross-validation data splitter
Return type: bool

split(self, X, y=None, groups=None)[source]#

Get the time series splits.

X and y are assumed to be sorted in ascending time order. This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty at the same time.

Parameters

X (pd.DataFrame, None) – Features to split.
y (pd.DataFrame, None) – Target variable to split. Defaults to None.
groups – Ignored but kept for compatibility with sklearn API. Defaults to None.

Yields

Iterator of (train, test) indices tuples.

Raises

ValueError – If one of the proposed splits would be empty.