data_splitters
=============================================

.. py:module:: evalml.preprocessing.data_splitters

.. autoapi-nested-parse::

   Data splitter classes.


Submodules
----------
.. toctree::
   :titlesonly:
   :maxdepth: 1

   no_split/index.rst
   sk_splitters/index.rst
   time_series_split/index.rst
   training_validation_split/index.rst


Package Contents
----------------

Classes Summary
~~~~~~~~~~~~~~~

.. autoapisummary::

   evalml.preprocessing.data_splitters.KFold
   evalml.preprocessing.data_splitters.NoSplit
   evalml.preprocessing.data_splitters.StratifiedKFold
   evalml.preprocessing.data_splitters.TimeSeriesSplit
   evalml.preprocessing.data_splitters.TrainingValidationSplit


Contents
~~~~~~~~~~~~~~~~~~~
.. py:class:: KFold(n_splits=5, *, shuffle=False, random_state=None)


   Wrapper class for sklearn's KFold splitter.


   **Methods**

   .. autoapisummary::
      :nosignatures:

      evalml.preprocessing.data_splitters.KFold.get_metadata_routing
      evalml.preprocessing.data_splitters.KFold.get_n_splits
      evalml.preprocessing.data_splitters.KFold.is_cv
      evalml.preprocessing.data_splitters.KFold.split

   .. py:method:: get_metadata_routing(self)

      Get metadata routing of this object.

      Please check :ref:`User Guide <metadata_routing>` on how the routing
      mechanism works.

      :returns: **routing** -- A :class:`~sklearn.utils.metadata_routing.MetadataRequest` encapsulating
                routing information.
      :rtype: MetadataRequest


   .. py:method:: get_n_splits(self, X=None, y=None, groups=None)

      Returns the number of splitting iterations in the cross-validator

      :param X: Always ignored, exists for compatibility.
      :type X: object
      :param y: Always ignored, exists for compatibility.
      :type y: object
      :param groups: Always ignored, exists for compatibility.
      :type groups: object

      :returns: **n_splits** -- Returns the number of splitting iterations in the cross-validator.
      :rtype: int


   .. py:method:: is_cv(self)
      :property:

      Returns whether or not the data splitter is a cross-validation data splitter.

      :returns: If the splitter is a cross-validation data splitter
      :rtype: bool


   .. py:method:: split(self, X, y=None, groups=None)

      Generate indices to split data into training and test set.

      :param X: Training data, where `n_samples` is the number of samples
                and `n_features` is the number of features.
      :type X: array-like of shape (n_samples, n_features)
      :param y: The target variable for supervised learning problems.
      :type y: array-like of shape (n_samples,), default=None
      :param groups: Group labels for the samples used while splitting the dataset into
                     train/test set.
      :type groups: array-like of shape (n_samples,), default=None

      :Yields: * **train** (*ndarray*) -- The training set indices for that split.
               * **test** (*ndarray*) -- The testing set indices for that split.


.. py:class:: NoSplit(random_seed=0)


   Does not split the training data into training and validation sets.

   All data is passed as the training set, test data is simply an array of
   `None`. To be used for future unsupervised learning, should not be used
   in any of the currently supported pipelines.

   :param random_seed: The seed to use for random sampling. Defaults to 0. Not used.
   :type random_seed: int


   **Methods**

   .. autoapisummary::
      :nosignatures:

      evalml.preprocessing.data_splitters.NoSplit.get_metadata_routing
      evalml.preprocessing.data_splitters.NoSplit.get_n_splits
      evalml.preprocessing.data_splitters.NoSplit.is_cv
      evalml.preprocessing.data_splitters.NoSplit.split

   .. py:method:: get_metadata_routing(self)

      Get metadata routing of this object.

      Please check :ref:`User Guide <metadata_routing>` on how the routing
      mechanism works.

      :returns: **routing** -- A :class:`~sklearn.utils.metadata_routing.MetadataRequest` encapsulating
                routing information.
      :rtype: MetadataRequest


   .. py:method:: get_n_splits()
      :staticmethod:

      Return the number of splits of this object.

      :returns: Always returns 0.
      :rtype: int


   .. py:method:: is_cv(self)
      :property:

      Returns whether or not the data splitter is a cross-validation data splitter.

      :returns: If the splitter is a cross-validation data splitter
      :rtype: bool


   .. py:method:: split(self, X, y=None)

      Divide the data into training and testing sets, where the testing set is empty.

      :param X: Dataframe of points to split
      :type X: pd.DataFrame
      :param y: Series of points to split
      :type y: pd.Series

      :returns: Indices to split data into training and test set
      :rtype: list


.. py:class:: StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)


   Wrapper class for sklearn's Stratified KFold splitter.


   **Methods**

   .. autoapisummary::
      :nosignatures:

      evalml.preprocessing.data_splitters.StratifiedKFold.get_metadata_routing
      evalml.preprocessing.data_splitters.StratifiedKFold.get_n_splits
      evalml.preprocessing.data_splitters.StratifiedKFold.is_cv
      evalml.preprocessing.data_splitters.StratifiedKFold.split

   .. py:method:: get_metadata_routing(self)

      Get metadata routing of this object.

      Please check :ref:`User Guide <metadata_routing>` on how the routing
      mechanism works.

      :returns: **routing** -- A :class:`~sklearn.utils.metadata_routing.MetadataRequest` encapsulating
                routing information.
      :rtype: MetadataRequest


   .. py:method:: get_n_splits(self, X=None, y=None, groups=None)

      Returns the number of splitting iterations in the cross-validator

      :param X: Always ignored, exists for compatibility.
      :type X: object
      :param y: Always ignored, exists for compatibility.
      :type y: object
      :param groups: Always ignored, exists for compatibility.
      :type groups: object

      :returns: **n_splits** -- Returns the number of splitting iterations in the cross-validator.
      :rtype: int


   .. py:method:: is_cv(self)
      :property:

      Returns whether or not the data splitter is a cross-validation data splitter.

      :returns: If the splitter is a cross-validation data splitter
      :rtype: bool


   .. py:method:: split(self, X, y, groups=None)

      Generate indices to split data into training and test set.

      :param X: Training data, where `n_samples` is the number of samples
                and `n_features` is the number of features.

                Note that providing ``y`` is sufficient to generate the splits and
                hence ``np.zeros(n_samples)`` may be used as a placeholder for
                ``X`` instead of actual training data.
      :type X: array-like of shape (n_samples, n_features)
      :param y: The target variable for supervised learning problems.
                Stratification is done based on the y labels.
      :type y: array-like of shape (n_samples,)
      :param groups: Always ignored, exists for compatibility.
      :type groups: object

      :Yields: * **train** (*ndarray*) -- The training set indices for that split.
               * **test** (*ndarray*) -- The testing set indices for that split.

      .. rubric:: Notes

      Randomized CV splitters may return different results for each call of
      split. You can make the results identical by setting `random_state`
      to an integer.


.. py:class:: TimeSeriesSplit(max_delay=0, gap=0, forecast_horizon=None, time_index=None, n_series=None, n_splits=3)


   Rolling Origin Cross Validation for time series problems.

   The max_delay, gap, and forecast_horizon parameters are only used to validate that the requested split size
   is not too small given these parameters.

   :param max_delay: Max delay value for feature engineering. Time series pipelines create delayed features
                     from existing features. This process will introduce NaNs into the first max_delay number of rows. The
                     splitter uses the last max_delay number of rows from the previous split as the first max_delay number
                     of rows of the current split to avoid "throwing out" more data than in necessary. Defaults to 0.
   :type max_delay: int
   :param gap: Number of time units separating the data used to generate features and the data to forecast on.
               Defaults to 0.
   :type gap: int
   :param forecast_horizon: Number of time units to forecast. Used for parameter validation. If an integer,
                            will set the size of the cv splits. Defaults to None.
   :type forecast_horizon: int, None
   :param time_index: Name of the column containing the datetime information used to order the data. Defaults to None.
   :type time_index: str
   :param n_splits: number of data splits to make. Defaults to 3.
   :type n_splits: int

   .. rubric:: Example

   >>> import numpy as np
   >>> import pandas as pd
   ...
   >>> X = pd.DataFrame([i for i in range(10)], columns=["First"])
   >>> y = pd.Series([i for i in range(10)])
   ...
   >>> ts_split = TimeSeriesSplit(n_splits=4)
   >>> generator_ = ts_split.split(X, y)
   ...
   >>> first_split = next(generator_)
   >>> assert (first_split[0] == np.array([0, 1])).all()
   >>> assert (first_split[1] == np.array([2, 3])).all()
   ...
   ...
   >>> second_split = next(generator_)
   >>> assert (second_split[0] == np.array([0, 1, 2, 3])).all()
   >>> assert (second_split[1] == np.array([4, 5])).all()
   ...
   ...
   >>> third_split = next(generator_)
   >>> assert (third_split[0] == np.array([0, 1, 2, 3, 4, 5])).all()
   >>> assert (third_split[1] == np.array([6, 7])).all()
   ...
   ...
   >>> fourth_split = next(generator_)
   >>> assert (fourth_split[0] == np.array([0, 1, 2, 3, 4, 5, 6, 7])).all()
   >>> assert (fourth_split[1] == np.array([8, 9])).all()


   **Methods**

   .. autoapisummary::
      :nosignatures:

      evalml.preprocessing.data_splitters.TimeSeriesSplit.get_metadata_routing
      evalml.preprocessing.data_splitters.TimeSeriesSplit.get_n_splits
      evalml.preprocessing.data_splitters.TimeSeriesSplit.is_cv
      evalml.preprocessing.data_splitters.TimeSeriesSplit.split

   .. py:method:: get_metadata_routing(self)

      Get metadata routing of this object.

      Please check :ref:`User Guide <metadata_routing>` on how the routing
      mechanism works.

      :returns: **routing** -- A :class:`~sklearn.utils.metadata_routing.MetadataRequest` encapsulating
                routing information.
      :rtype: MetadataRequest


   .. py:method:: get_n_splits(self, X=None, y=None, groups=None)

      Get the number of data splits.

      :param X: Features to split.
      :type X: pd.DataFrame, None
      :param y: Target variable to split. Defaults to None.
      :type y: pd.DataFrame, None
      :param groups: Ignored but kept for compatibility with sklearn API. Defaults to None.

      :returns: Number of splits.


   .. py:method:: is_cv(self)
      :property:

      Returns whether or not the data splitter is a cross-validation data splitter.

      :returns: If the splitter is a cross-validation data splitter
      :rtype: bool


   .. py:method:: split(self, X, y=None, groups=None)

      Get the time series splits.

      X and y are assumed to be sorted in ascending time order.
      This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty
      at the same time.

      :param X: Features to split.
      :type X: pd.DataFrame, None
      :param y: Target variable to split. Defaults to None.
      :type y: pd.DataFrame, None
      :param groups: Ignored but kept for compatibility with sklearn API. Defaults to None.

      :Yields: Iterator of (train, test) indices tuples.

      :raises ValueError: If one of the proposed splits would be empty.


.. py:class:: TrainingValidationSplit(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0)


   Split the training data into training and validation sets.

   :param test_size: What percentage of data points should be included in the validation
                     set. Defalts to the complement of `train_size` if `train_size` is set, and 0.25 otherwise.
   :type test_size: float
   :param train_size: What percentage of data points should be included in the training set.
                      Defaults to the complement of `test_size`
   :type train_size: float
   :param shuffle: Whether to shuffle the data before splitting. Defaults to False.
   :type shuffle: boolean
   :param stratify: Splits the data in a stratified fashion, using this argument as class labels.
                    Defaults to None.
   :type stratify: list
   :param random_seed: The seed to use for random sampling. Defaults to 0.
   :type random_seed: int

   .. rubric:: Examples

   >>> import numpy as np
   >>> import pandas as pd
   ...
   >>> X = pd.DataFrame([i for i in range(10)], columns=["First"])
   >>> y = pd.Series([i for i in range(10)])
   ...
   >>> tv_split = TrainingValidationSplit()
   >>> split_ = next(tv_split.split(X, y))
   >>> assert (split_[0] == np.array([0, 1, 2, 3, 4, 5, 6])).all()
   >>> assert (split_[1] == np.array([7, 8, 9])).all()
   ...
   ...
   >>> tv_split = TrainingValidationSplit(test_size=0.5)
   >>> split_ = next(tv_split.split(X, y))
   >>> assert (split_[0] == np.array([0, 1, 2, 3, 4])).all()
   >>> assert (split_[1] == np.array([5, 6, 7, 8, 9])).all()
   ...
   ...
   >>> tv_split = TrainingValidationSplit(shuffle=True)
   >>> split_ = next(tv_split.split(X, y))
   >>> assert (split_[0] == np.array([9, 1, 6, 7, 3, 0, 5])).all()
   >>> assert (split_[1] == np.array([2, 8, 4])).all()
   ...
   ...
   >>> y = pd.Series([i % 3 for i in range(10)])
   >>> tv_split = TrainingValidationSplit(shuffle=True, stratify=y)
   >>> split_ = next(tv_split.split(X, y))
   >>> assert (split_[0] == np.array([1, 9, 3, 2, 8, 6, 7])).all()
   >>> assert (split_[1] == np.array([0, 4, 5])).all()


   **Methods**

   .. autoapisummary::
      :nosignatures:

      evalml.preprocessing.data_splitters.TrainingValidationSplit.get_metadata_routing
      evalml.preprocessing.data_splitters.TrainingValidationSplit.get_n_splits
      evalml.preprocessing.data_splitters.TrainingValidationSplit.is_cv
      evalml.preprocessing.data_splitters.TrainingValidationSplit.split

   .. py:method:: get_metadata_routing(self)

      Get metadata routing of this object.

      Please check :ref:`User Guide <metadata_routing>` on how the routing
      mechanism works.

      :returns: **routing** -- A :class:`~sklearn.utils.metadata_routing.MetadataRequest` encapsulating
                routing information.
      :rtype: MetadataRequest


   .. py:method:: get_n_splits()
      :staticmethod:

      Return the number of splits of this object.

      :returns: Always returns 1.
      :rtype: int


   .. py:method:: is_cv(self)
      :property:

      Returns whether or not the data splitter is a cross-validation data splitter.

      :returns: If the splitter is a cross-validation data splitter
      :rtype: bool


   .. py:method:: split(self, X, y=None)

      Divide the data into training and testing sets.

      :param X: Dataframe of points to split
      :type X: pd.DataFrame
      :param y: Series of points to split
      :type y: pd.Series

      :returns: Indices to split data into training and test set
      :rtype: list