data_splitters¶
Submodules¶
Package Contents¶
Classes Summary¶
Class for balanced classification downsampler. |
|
Base class for all custom samplers. |
|
Rolling Origin Cross Validation for time series problems. |
|
Split the training data into training and validation sets. |
Contents¶
-
class
evalml.preprocessing.data_splitters.
BalancedClassificationSampler
(sampling_ratio=0.25, sampling_ratio_dict=None, min_samples=100, min_percentage=0.1, random_seed=0)[source]¶ Class for balanced classification downsampler.
- Parameters
sampling_ratio (float) – The smallest minority:majority ratio that is accepted as ‘balanced’. For instance, a 1:4 ratio would be represented as 0.25, while a 1:1 ratio is 1.0. Must be between 0 and 1, inclusive. Defaults to 0.25.
sampling_ratio_dict (dict) – A dictionary specifying the desired balanced ratio for each target value. Overrides sampling_ratio if provided. Defaults to None.
min_samples (int) – The minimum number of samples that we must have for any class, pre or post sampling. If a class must be downsampled, it will not be downsampled past this value. To determine severe imbalance, the minority class must occur less often than this and must have a class ratio below min_percentage. Must be greater than 0. Defaults to 100.
min_percentage (float) – The minimum percentage of the minimum class to total dataset that we tolerate, as long as it is above min_samples. To determine severe imbalance, the minority class must have a class ratio below this and must occur less often than min_samples. Must be between 0 and 0.5, inclusive. Defaults to 0.1.
random_seed (int) – The seed to use for random sampling. Defaults to 0.
Methods
Resampling technique for this sampler.
-
class
evalml.preprocessing.data_splitters.
SamplerBase
(random_seed=0)[source]¶ Base class for all custom samplers.
- Parameters
random_seed (int) – The seed to use for random sampling. Defaults to 0.
Methods
Resample the input data with this sampling strategy.
-
abstract
fit_resample
(self, X, y)[source]¶ Resample the input data with this sampling strategy.
- Parameters
X (pd.DataFrame) – Training data to fit and resample.
y (pd.Series) – Training data targets to fit and resample.
- Returns
resampled X and y data for oversampling or indices to keep for undersampling.
- Return type
Tuple(pd.DataFrame, pd.Series) or list
-
class
evalml.preprocessing.data_splitters.
TimeSeriesSplit
(max_delay=0, gap=0, date_index=None, n_splits=3)[source]¶ Rolling Origin Cross Validation for time series problems.
This class uses max_delay and gap values to take into account that evalml time series pipelines perform some feature and target engineering, e.g delaying input features and shifting the target variable by the desired amount. If the data that will be split already has all the features and appropriate target values, and then set max_delay and gap to 0.
- Parameters
max_delay (int) – Max delay value for feature engineering. Time series pipelines create delayed features from existing features. This process will introduce NaNs into the first max_delay number of rows. The splitter uses the last max_delay number of rows from the previous split as the first max_delay number of rows of the current split to avoid “throwing out” more data than in necessary. Defaults to 0.
gap (int) – Gap used in time series problem. Time series pipelines shift the target variable by gap rows. Defaults to 0.
date_index (str) – Name of the column containing the datetime information used to order the data. Defaults to None.
n_splits (int) – number of data splits to make. Defaults to 3.
Methods
Get the number of data splits.
Get the time series splits.
-
split
(self, X, y=None, groups=None)[source]¶ Get the time series splits.
X and y are assumed to be sorted in ascending time order. This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty at the same time.
- Parameters
X (pd.DataFrame, None) – Features to split.
y (pd.DataFrame, None) – Target variable to split.
groups – Ignored but kept for compatibility with sklearn api.
- Returns
Iterator of (train, test) indices tuples.
-
class
evalml.preprocessing.data_splitters.
TrainingValidationSplit
(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0)[source]¶ Split the training data into training and validation sets.
- Parameters
test_size (float) – What percentage of data points should be included in the validation set. Defalts to the complement of train_size if train_size is set, and 0.25 otherwise.
train_size (float) – What percentage of data points should be included in the training set. Defaults to the complement of test_size
shuffle (boolean) – Whether to shuffle the data before splitting. Defaults to False.
stratify (list) – Splits the data in a stratified fashion, using this argument as class labels. Defaults to None.
random_seed (int) – The seed to use for random sampling. Defaults to 0.
Methods
Returns the number of splits of this object
Divides the data into training and testing sets