training_validation_split#

Training Validation Split class.

Module Contents#

Classes Summary#

TrainingValidationSplit

Split the training data into training and validation sets.

Contents#

class evalml.preprocessing.data_splitters.training_validation_split.TrainingValidationSplit(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0)[source]#

Split the training data into training and validation sets.

Parameters
  • test_size (float) – What percentage of data points should be included in the validation set. Defalts to the complement of train_size if train_size is set, and 0.25 otherwise.

  • train_size (float) – What percentage of data points should be included in the training set. Defaults to the complement of test_size

  • shuffle (boolean) – Whether to shuffle the data before splitting. Defaults to False.

  • stratify (list) – Splits the data in a stratified fashion, using this argument as class labels. Defaults to None.

  • random_seed (int) – The seed to use for random sampling. Defaults to 0.

Examples

>>> import numpy as np
>>> import pandas as pd
...
>>> X = pd.DataFrame([i for i in range(10)], columns=["First"])
>>> y = pd.Series([i for i in range(10)])
...
>>> tv_split = TrainingValidationSplit()
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([0, 1, 2, 3, 4, 5, 6])).all()
>>> assert (split_[1] == np.array([7, 8, 9])).all()
...
...
>>> tv_split = TrainingValidationSplit(test_size=0.5)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([0, 1, 2, 3, 4])).all()
>>> assert (split_[1] == np.array([5, 6, 7, 8, 9])).all()
...
...
>>> tv_split = TrainingValidationSplit(shuffle=True)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([9, 1, 6, 7, 3, 0, 5])).all()
>>> assert (split_[1] == np.array([2, 8, 4])).all()
...
...
>>> y = pd.Series([i % 3 for i in range(10)])
>>> tv_split = TrainingValidationSplit(shuffle=True, stratify=y)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([1, 9, 3, 2, 8, 6, 7])).all()
>>> assert (split_[1] == np.array([0, 4, 5])).all()

Methods

get_metadata_routing

Get metadata routing of this object.

get_n_splits

Return the number of splits of this object.

is_cv

Returns whether or not the data splitter is a cross-validation data splitter.

split

Divide the data into training and testing sets.

get_metadata_routing(self)#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns

routing – A MetadataRequest encapsulating routing information.

Return type

MetadataRequest

static get_n_splits()[source]#

Return the number of splits of this object.

Returns

Always returns 1.

Return type

int

property is_cv(self)#

Returns whether or not the data splitter is a cross-validation data splitter.

Returns

If the splitter is a cross-validation data splitter

Return type

bool

split(self, X, y=None)[source]#

Divide the data into training and testing sets.

Parameters
  • X (pd.DataFrame) – Dataframe of points to split

  • y (pd.Series) – Series of points to split

Returns

Indices to split data into training and test set

Return type

list