samplers =========================================================== .. py:module:: evalml.pipelines.components.transformers.samplers .. autoapi-nested-parse:: Sampler components. Submodules ---------- .. toctree:: :titlesonly: :maxdepth: 1 base_sampler/index.rst oversampler/index.rst undersampler/index.rst Package Contents ---------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.pipelines.components.transformers.samplers.Oversampler evalml.pipelines.components.transformers.samplers.Undersampler Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: Oversampler(sampling_ratio=0.25, sampling_ratio_dict=None, k_neighbors_default=5, n_jobs=-1, random_seed=0, **kwargs) SMOTE Oversampler component. Will automatically select whether to use SMOTE, SMOTEN, or SMOTENC based on inputs to the component. :param sampling_ratio: This is the goal ratio of the minority to majority class, with range (0, 1]. A value of 0.25 means we want a 1:4 ratio of the minority to majority class after oversampling. We will create the a sampling dictionary using this ratio, with the keys corresponding to the class and the values responding to the number of samples. Defaults to 0.25. :type sampling_ratio: float :param sampling_ratio_dict: A dictionary specifying the desired balanced ratio for each target value. For instance, in a binary case where class 1 is the minority, we could specify: `sampling_ratio_dict={0: 0.5, 1: 1}`, which means we would undersample class 0 to have twice the number of samples as class 1 (minority:majority ratio = 0.5), and don't sample class 1. Overrides sampling_ratio if provided. Defaults to None. :type sampling_ratio_dict: dict :param k_neighbors_default: The number of nearest neighbors used to construct synthetic samples. This is the default value used, but the actual k_neighbors value might be smaller if there are less samples. Defaults to 5. :type k_neighbors_default: int :param n_jobs: The number of CPU cores to use. Defaults to -1. :type n_jobs: int :param random_seed: The seed to use for random sampling. Defaults to 0. :type random_seed: int **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **hyperparameter_ranges** - None * - **modifies_features** - True * - **modifies_target** - True * - **name** - Oversampler * - **training_only** - True **Methods** .. autoapisummary:: :nosignatures: evalml.pipelines.components.transformers.samplers.Oversampler.clone evalml.pipelines.components.transformers.samplers.Oversampler.default_parameters evalml.pipelines.components.transformers.samplers.Oversampler.describe evalml.pipelines.components.transformers.samplers.Oversampler.fit evalml.pipelines.components.transformers.samplers.Oversampler.fit_transform evalml.pipelines.components.transformers.samplers.Oversampler.load evalml.pipelines.components.transformers.samplers.Oversampler.needs_fitting evalml.pipelines.components.transformers.samplers.Oversampler.parameters evalml.pipelines.components.transformers.samplers.Oversampler.save evalml.pipelines.components.transformers.samplers.Oversampler.transform evalml.pipelines.components.transformers.samplers.Oversampler.update_parameters .. py:method:: clone(self) Constructs a new component with the same parameters and random state. :returns: A new instance of this component with identical parameters and random state. .. py:method:: default_parameters(cls) Returns the default parameters for this component. Our convention is that Component.default_parameters == Component().parameters. :returns: Default parameters for this component. :rtype: dict .. py:method:: describe(self, print_name=False, return_dict=False) Describe a component and its parameters. :param print_name: whether to print name of component :type print_name: bool, optional :param return_dict: whether to return description as dictionary in the format {"name": name, "parameters": parameters} :type return_dict: bool, optional :returns: Returns dictionary if return_dict is True, else None. :rtype: None or dict .. py:method:: fit(self, X, y) Fits oversampler to data. :param X: The input training data of shape [n_samples, n_features]. :type X: pd.DataFrame :param y: The target training data of length [n_samples]. :type y: pd.Series, optional :returns: self .. py:method:: fit_transform(self, X, y) Fit and transform data using the sampler component. :param X: The input training data of shape [n_samples, n_features]. :type X: pd.DataFrame :param y: The target training data of length [n_samples]. :type y: pd.Series, optional :returns: Transformed data. :rtype: (pd.DataFrame, pd.Series) .. py:method:: load(file_path) :staticmethod: Loads component at file path. :param file_path: Location to load file. :type file_path: str :returns: ComponentBase object .. py:method:: needs_fitting(self) Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances. This can be overridden to False for components that do not need to be fit or whose fit methods do nothing. :returns: True. .. py:method:: parameters(self) :property: Returns the parameters which were used to initialize the component. .. py:method:: save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL) Saves component at file path. :param file_path: Location to save file. :type file_path: str :param pickle_protocol: The pickle data stream format. :type pickle_protocol: int .. py:method:: transform(self, X, y=None) Transforms the input data by Oversampling the data. :param X: Training features. :type X: pd.DataFrame :param y: Target. :type y: pd.Series :returns: Transformed features and target. :rtype: pd.DataFrame, pd.Series .. py:method:: update_parameters(self, update_dict, reset_fit=True) Updates the parameter dictionary of the component. :param update_dict: A dict of parameters to update. :type update_dict: dict :param reset_fit: If True, will set `_is_fitted` to False. :type reset_fit: bool, optional .. py:class:: Undersampler(sampling_ratio=0.25, sampling_ratio_dict=None, min_samples=100, min_percentage=0.1, random_seed=0, **kwargs) Initializes an undersampling transformer to downsample the majority classes in the dataset. This component is only run during training and not during predict. :param sampling_ratio: The smallest minority:majority ratio that is accepted as 'balanced'. For instance, a 1:4 ratio would be represented as 0.25, while a 1:1 ratio is 1.0. Must be between 0 and 1, inclusive. Defaults to 0.25. :type sampling_ratio: float :param sampling_ratio_dict: A dictionary specifying the desired balanced ratio for each target value. For instance, in a binary case where class 1 is the minority, we could specify: `sampling_ratio_dict={0: 0.5, 1: 1}`, which means we would undersample class 0 to have twice the number of samples as class 1 (minority:majority ratio = 0.5), and don't sample class 1. Overrides sampling_ratio if provided. Defaults to None. :type sampling_ratio_dict: dict :param min_samples: The minimum number of samples that we must have for any class, pre or post sampling. If a class must be downsampled, it will not be downsampled past this value. To determine severe imbalance, the minority class must occur less often than this and must have a class ratio below min_percentage. Must be greater than 0. Defaults to 100. :type min_samples: int :param min_percentage: The minimum percentage of the minimum class to total dataset that we tolerate, as long as it is above min_samples. If min_percentage and min_samples are not met, treat this as severely imbalanced, and we will not resample the data. Must be between 0 and 0.5, inclusive. Defaults to 0.1. :type min_percentage: float :param random_seed: The seed to use for random sampling. Defaults to 0. :type random_seed: int :raises ValueError: If sampling_ratio is not in the range (0, 1]. :raises ValueError: If min_sample is not greater than 0. :raises ValueError: If min_percentage is not between 0 and 0.5, inclusive. **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **hyperparameter_ranges** - {} * - **modifies_features** - True * - **modifies_target** - True * - **name** - Undersampler * - **training_only** - True **Methods** .. autoapisummary:: :nosignatures: evalml.pipelines.components.transformers.samplers.Undersampler.clone evalml.pipelines.components.transformers.samplers.Undersampler.default_parameters evalml.pipelines.components.transformers.samplers.Undersampler.describe evalml.pipelines.components.transformers.samplers.Undersampler.fit evalml.pipelines.components.transformers.samplers.Undersampler.fit_resample evalml.pipelines.components.transformers.samplers.Undersampler.fit_transform evalml.pipelines.components.transformers.samplers.Undersampler.load evalml.pipelines.components.transformers.samplers.Undersampler.needs_fitting evalml.pipelines.components.transformers.samplers.Undersampler.parameters evalml.pipelines.components.transformers.samplers.Undersampler.save evalml.pipelines.components.transformers.samplers.Undersampler.transform evalml.pipelines.components.transformers.samplers.Undersampler.update_parameters .. py:method:: clone(self) Constructs a new component with the same parameters and random state. :returns: A new instance of this component with identical parameters and random state. .. py:method:: default_parameters(cls) Returns the default parameters for this component. Our convention is that Component.default_parameters == Component().parameters. :returns: Default parameters for this component. :rtype: dict .. py:method:: describe(self, print_name=False, return_dict=False) Describe a component and its parameters. :param print_name: whether to print name of component :type print_name: bool, optional :param return_dict: whether to return description as dictionary in the format {"name": name, "parameters": parameters} :type return_dict: bool, optional :returns: Returns dictionary if return_dict is True, else None. :rtype: None or dict .. py:method:: fit(self, X, y) Fits the sampler to the data. :param X: Input features. :type X: pd.DataFrame :param y: Target. :type y: pd.Series :returns: self :raises ValueError: If y is None. .. py:method:: fit_resample(self, X, y) Resampling technique for this sampler. :param X: Training data to fit and resample. :type X: pd.DataFrame :param y: Training data targets to fit and resample. :type y: pd.Series :returns: Indices to keep for training data. :rtype: list .. py:method:: fit_transform(self, X, y) Fit and transform data using the sampler component. :param X: The input training data of shape [n_samples, n_features]. :type X: pd.DataFrame :param y: The target training data of length [n_samples]. :type y: pd.Series, optional :returns: Transformed data. :rtype: (pd.DataFrame, pd.Series) .. py:method:: load(file_path) :staticmethod: Loads component at file path. :param file_path: Location to load file. :type file_path: str :returns: ComponentBase object .. py:method:: needs_fitting(self) Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances. This can be overridden to False for components that do not need to be fit or whose fit methods do nothing. :returns: True. .. py:method:: parameters(self) :property: Returns the parameters which were used to initialize the component. .. py:method:: save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL) Saves component at file path. :param file_path: Location to save file. :type file_path: str :param pickle_protocol: The pickle data stream format. :type pickle_protocol: int .. py:method:: transform(self, X, y=None) Transforms the input data by sampling the data. :param X: Training features. :type X: pd.DataFrame :param y: Target. :type y: pd.Series :returns: Transformed features and target. :rtype: pd.DataFrame, pd.Series .. py:method:: update_parameters(self, update_dict, reset_fit=True) Updates the parameter dictionary of the component. :param update_dict: A dict of parameters to update. :type update_dict: dict :param reset_fit: If True, will set `_is_fitted` to False. :type reset_fit: bool, optional