data_splitters

Data splitter classes.

Package Contents

Classes Summary

NoSplit

Does not split the training data into training and validation sets.

TimeSeriesSplit

Rolling Origin Cross Validation for time series problems.

TrainingValidationSplit

Split the training data into training and validation sets.

Contents

class evalml.preprocessing.data_splitters.NoSplit(random_seed=0)[source]

Does not split the training data into training and validation sets.

All data is passed as the training set, test data is simply an array of None. To be used for future unsupervised learning, should not be used in any of the currently supported pipelines.

Parameters

random_seed (int) – The seed to use for random sampling. Defaults to 0. Not used.

Methods

get_n_splits

Return the number of splits of this object.

split

Divide the data into training and testing sets, where the testing set is empty.

static get_n_splits()[source]

Return the number of splits of this object.

Returns

Always returns 0.

Return type

int

split(self, X, y=None)[source]

Divide the data into training and testing sets, where the testing set is empty.

Parameters
  • X (pd.DataFrame) – Dataframe of points to split

  • y (pd.Series) – Series of points to split

Returns

Indices to split data into training and test set

Return type

list

class evalml.preprocessing.data_splitters.TimeSeriesSplit(max_delay=0, gap=0, forecast_horizon=1, time_index=None, n_splits=3)[source]

Rolling Origin Cross Validation for time series problems.

The max_delay, gap, and forecast_horizon parameters are only used to validate that the requested split size is not too small given these parameters.

Parameters
  • max_delay (int) – Max delay value for feature engineering. Time series pipelines create delayed features from existing features. This process will introduce NaNs into the first max_delay number of rows. The splitter uses the last max_delay number of rows from the previous split as the first max_delay number of rows of the current split to avoid “throwing out” more data than in necessary. Defaults to 0.

  • gap (int) – Number of time units separating the data used to generate features and the data to forecast on. Defaults to 0.

  • forecast_horizon (int) – Number of time units to forecast. Defaults to 1.

  • time_index (str) – Name of the column containing the datetime information used to order the data. Defaults to None.

  • n_splits (int) – number of data splits to make. Defaults to 3.

Example

>>> import numpy as np
>>> import pandas as pd
...
>>> X = pd.DataFrame([i for i in range(10)], columns=["First"])
>>> y = pd.Series([i for i in range(10)])
...
>>> ts_split = TimeSeriesSplit(n_splits=4)
>>> generator_ = ts_split.split(X, y)
...
>>> first_split = next(generator_)
>>> assert (first_split[0] == np.array([0, 1])).all()
>>> assert (first_split[1] == np.array([2, 3])).all()
...
...
>>> second_split = next(generator_)
>>> assert (second_split[0] == np.array([0, 1, 2, 3])).all()
>>> assert (second_split[1] == np.array([4, 5])).all()
...
...
>>> third_split = next(generator_)
>>> assert (third_split[0] == np.array([0, 1, 2, 3, 4, 5])).all()
>>> assert (third_split[1] == np.array([6, 7])).all()
...
...
>>> fourth_split = next(generator_)
>>> assert (fourth_split[0] == np.array([0, 1, 2, 3, 4, 5, 6, 7])).all()
>>> assert (fourth_split[1] == np.array([8, 9])).all()

Methods

get_n_splits

Get the number of data splits.

split

Get the time series splits.

get_n_splits(self, X=None, y=None, groups=None)[source]

Get the number of data splits.

Parameters
  • X (pd.DataFrame, None) – Features to split.

  • y (pd.DataFrame, None) – Target variable to split. Defaults to None.

  • groups – Ignored but kept for compatibility with sklearn API. Defaults to None.

Returns

Number of splits.

split(self, X, y=None, groups=None)[source]

Get the time series splits.

X and y are assumed to be sorted in ascending time order. This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty at the same time.

Parameters
  • X (pd.DataFrame, None) – Features to split.

  • y (pd.DataFrame, None) – Target variable to split. Defaults to None.

  • groups – Ignored but kept for compatibility with sklearn API. Defaults to None.

Yields

Iterator of (train, test) indices tuples.

Raises

ValueError – If one of the proposed splits would be empty.

class evalml.preprocessing.data_splitters.TrainingValidationSplit(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0)[source]

Split the training data into training and validation sets.

Parameters
  • test_size (float) – What percentage of data points should be included in the validation set. Defalts to the complement of train_size if train_size is set, and 0.25 otherwise.

  • train_size (float) – What percentage of data points should be included in the training set. Defaults to the complement of test_size

  • shuffle (boolean) – Whether to shuffle the data before splitting. Defaults to False.

  • stratify (list) – Splits the data in a stratified fashion, using this argument as class labels. Defaults to None.

  • random_seed (int) – The seed to use for random sampling. Defaults to 0.

Examples

>>> import numpy as np
>>> import pandas as pd
...
>>> X = pd.DataFrame([i for i in range(10)], columns=["First"])
>>> y = pd.Series([i for i in range(10)])
...
>>> tv_split = TrainingValidationSplit()
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([0, 1, 2, 3, 4, 5, 6])).all()
>>> assert (split_[1] == np.array([7, 8, 9])).all()
...
...
>>> tv_split = TrainingValidationSplit(test_size=0.5)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([0, 1, 2, 3, 4])).all()
>>> assert (split_[1] == np.array([5, 6, 7, 8, 9])).all()
...
...
>>> tv_split = TrainingValidationSplit(shuffle=True)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([9, 1, 6, 7, 3, 0, 5])).all()
>>> assert (split_[1] == np.array([2, 8, 4])).all()
...
...
>>> y = pd.Series([i % 3 for i in range(10)])
>>> tv_split = TrainingValidationSplit(shuffle=True, stratify=y)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([1, 9, 3, 2, 8, 6, 7])).all()
>>> assert (split_[1] == np.array([0, 4, 5])).all()

Methods

get_n_splits

Return the number of splits of this object.

split

Divide the data into training and testing sets.

static get_n_splits()[source]

Return the number of splits of this object.

Returns

Always returns 1.

Return type

int

split(self, X, y=None)[source]

Divide the data into training and testing sets.

Parameters
  • X (pd.DataFrame) – Dataframe of points to split

  • y (pd.Series) – Series of points to split

Returns

Indices to split data into training and test set

Return type

list