data_splitters

Package Contents

Classes Summary

BalancedClassificationSampler

Class for balanced classification downsampler.

SamplerBase

Base class for all custom samplers.

TimeSeriesSplit

Rolling Origin Cross Validation for time series problems.

TrainingValidationSplit

Split the training data into training and validation sets.

Contents

class evalml.preprocessing.data_splitters.BalancedClassificationSampler(sampling_ratio=0.25, sampling_ratio_dict=None, min_samples=100, min_percentage=0.1, random_seed=0)[source]

Class for balanced classification downsampler.

Parameters
  • sampling_ratio (float) – The smallest minority:majority ratio that is accepted as ‘balanced’. For instance, a 1:4 ratio would be represented as 0.25, while a 1:1 ratio is 1.0. Must be between 0 and 1, inclusive. Defaults to 0.25.

  • sampling_ratio_dict (dict) – A dictionary specifying the desired balanced ratio for each target value. Overrides sampling_ratio if provided. Defaults to None.

  • min_samples (int) – The minimum number of samples that we must have for any class, pre or post sampling. If a class must be downsampled, it will not be downsampled past this value. To determine severe imbalance, the minority class must occur less often than this and must have a class ratio below min_percentage. Must be greater than 0. Defaults to 100.

  • min_percentage (float) – The minimum percentage of the minimum class to total dataset that we tolerate, as long as it is above min_samples. To determine severe imbalance, the minority class must have a class ratio below this and must occur less often than min_samples. Must be between 0 and 0.5, inclusive. Defaults to 0.1.

  • random_seed (int) – The seed to use for random sampling. Defaults to 0.

Methods

fit_resample

Resampling technique for this sampler.

fit_resample(self, X, y)[source]

Resampling technique for this sampler.

Parameters
  • X (pd.DataFrame) – Training data to fit and resample

  • y (pd.Series) – Training data targets to fit and resample

Returns

Indices to keep for training data

Return type

list

class evalml.preprocessing.data_splitters.SamplerBase(random_seed=0)[source]

Base class for all custom samplers.

Parameters

random_seed (int) – The seed to use for random sampling. Defaults to 0.

Methods

fit_resample

Resample the input data with this sampling strategy.

abstract fit_resample(self, X, y)[source]

Resample the input data with this sampling strategy.

Parameters
  • X (pd.DataFrame) – Training data to fit and resample

  • y (pd.Series) – Training data targets to fit and resample

Returns

resampled X and y data for oversampling or indices to keep for undersampling

Return type

Tuple(pd.DataFrame, pd.Series) or list

class evalml.preprocessing.data_splitters.TimeSeriesSplit(max_delay=0, gap=0, date_index=None, n_splits=3)[source]

Rolling Origin Cross Validation for time series problems.

This class uses max_delay and gap values to take into account that evalml time series pipelines perform some feature and target engineering, e.g delaying input features and shifting the target variable by the desired amount. If the data that will be split already has all the features and appropriate target values, and then set max_delay and gap to 0.

Parameters
  • max_delay (int) – Max delay value for feature engineering. Time series pipelines create delayed features from existing features. This process will introduce NaNs into the first max_delay number of rows. The splitter uses the last max_delay number of rows from the previous split as the first max_delay number of rows of the current split to avoid “throwing out” more data than in necessary. Defaults to 0.

  • gap (int) – Gap used in time series problem. Time series pipelines shift the target variable by gap rows. Defaults to 0.

  • date_index (str) – Name of the column containing the datetime information used to order the data. Defaults to None.

  • n_splits (int) – number of data splits to make. Defaults to 3.

Methods

get_n_splits

Get the number of data splits.

split

Get the time series splits.

get_n_splits(self, X=None, y=None, groups=None)[source]

Get the number of data splits.

split(self, X, y=None, groups=None)[source]

Get the time series splits.

X and y are assumed to be sorted in ascending time order. This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty at the same time.

Parameters
  • X (pd.DataFrame, None) – Features to split.

  • y (pd.DataFrame, None) – Target variable to split.

  • groups – Ignored but kept for compatibility with sklearn api.

Returns

Iterator of (train, test) indices tuples.

class evalml.preprocessing.data_splitters.TrainingValidationSplit(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0)[source]

Split the training data into training and validation sets.

Parameters
  • test_size (float) – What percentage of data points should be included in the validation set. Defalts to the complement of train_size if train_size is set, and 0.25 otherwise.

  • train_size (float) – What percentage of data points should be included in the training set. Defaults to the complement of test_size

  • shuffle (boolean) – Whether to shuffle the data before splitting. Defaults to False.

  • stratify (list) – Splits the data in a stratified fashion, using this argument as class labels. Defaults to None.

  • random_seed (int) – The seed to use for random sampling. Defaults to 0.

Methods

get_n_splits

Returns the number of splits of this object

split

Divides the data into training and testing sets

static get_n_splits()[source]

Returns the number of splits of this object

split(self, X, y=None)[source]

Divides the data into training and testing sets

Parameters
  • X (pd.DataFrame) – Dataframe of points to split

  • y (pd.Series) – Series of points to split

Returns

Indices to split data into training and test set

Return type

list