balanced_classification_sampler

Module Contents

Classes Summary

BalancedClassificationSampler

Class for balanced classification downsampler.

Contents

class evalml.preprocessing.data_splitters.balanced_classification_sampler.BalancedClassificationSampler(sampling_ratio=0.25, sampling_ratio_dict=None, min_samples=100, min_percentage=0.1, random_seed=0)[source]

Class for balanced classification downsampler.

Parameters
  • sampling_ratio (float) – The smallest minority:majority ratio that is accepted as ‘balanced’. For instance, a 1:4 ratio would be represented as 0.25, while a 1:1 ratio is 1.0. Must be between 0 and 1, inclusive. Defaults to 0.25.

  • sampling_ratio_dict (dict) – A dictionary specifying the desired balanced ratio for each target value. Overrides sampling_ratio if provided. Defaults to None.

  • min_samples (int) – The minimum number of samples that we must have for any class, pre or post sampling. If a class must be downsampled, it will not be downsampled past this value. To determine severe imbalance, the minority class must occur less often than this and must have a class ratio below min_percentage. Must be greater than 0. Defaults to 100.

  • min_percentage (float) – The minimum percentage of the minimum class to total dataset that we tolerate, as long as it is above min_samples. To determine severe imbalance, the minority class must have a class ratio below this and must occur less often than min_samples. Must be between 0 and 0.5, inclusive. Defaults to 0.1.

  • random_seed (int) – The seed to use for random sampling. Defaults to 0.

Methods

fit_resample

Resampling technique for this sampler.

fit_resample(self, X, y)[source]

Resampling technique for this sampler.

Parameters
  • X (pd.DataFrame) – Training data to fit and resample

  • y (pd.Series) – Training data targets to fit and resample

Returns

Indices to keep for training data

Return type

list