data_splitters ============================================= .. py:module:: evalml.preprocessing.data_splitters .. autoapi-nested-parse:: Data splitter classes. Submodules ---------- .. toctree:: :titlesonly: :maxdepth: 1 no_split/index.rst sk_splitters/index.rst time_series_split/index.rst training_validation_split/index.rst Package Contents ---------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.preprocessing.data_splitters.KFold evalml.preprocessing.data_splitters.NoSplit evalml.preprocessing.data_splitters.StratifiedKFold evalml.preprocessing.data_splitters.TimeSeriesSplit evalml.preprocessing.data_splitters.TrainingValidationSplit Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: KFold(n_splits=5, *, shuffle=False, random_state=None) Wrapper class for sklearn's KFold splitter. **Methods** .. autoapisummary:: :nosignatures: evalml.preprocessing.data_splitters.KFold.get_metadata_routing evalml.preprocessing.data_splitters.KFold.get_n_splits evalml.preprocessing.data_splitters.KFold.is_cv evalml.preprocessing.data_splitters.KFold.split .. py:method:: get_metadata_routing(self) Get metadata routing of this object. Please check :ref:`User Guide ` on how the routing mechanism works. :returns: **routing** -- A :class:`~utils.metadata_routing.MetadataRequest` encapsulating routing information. :rtype: MetadataRequest .. py:method:: get_n_splits(self, X=None, y=None, groups=None) Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for compatibility. :type X: object :param y: Always ignored, exists for compatibility. :type y: object :param groups: Always ignored, exists for compatibility. :type groups: object :returns: **n_splits** -- Returns the number of splitting iterations in the cross-validator. :rtype: int .. py:method:: is_cv(self) :property: Returns whether or not the data splitter is a cross-validation data splitter. :returns: If the splitter is a cross-validation data splitter :rtype: bool .. py:method:: split(self, X, y=None, groups=None) Generate indices to split data into training and test set. :param X: Training data, where `n_samples` is the number of samples and `n_features` is the number of features. :type X: array-like of shape (n_samples, n_features) :param y: The target variable for supervised learning problems. :type y: array-like of shape (n_samples,), default=None :param groups: Group labels for the samples used while splitting the dataset into train/test set. :type groups: array-like of shape (n_samples,), default=None :Yields: * **train** (*ndarray*) -- The training set indices for that split. * **test** (*ndarray*) -- The testing set indices for that split. .. py:class:: NoSplit(random_seed=0) Does not split the training data into training and validation sets. All data is passed as the training set, test data is simply an array of `None`. To be used for future unsupervised learning, should not be used in any of the currently supported pipelines. :param random_seed: The seed to use for random sampling. Defaults to 0. Not used. :type random_seed: int **Methods** .. autoapisummary:: :nosignatures: evalml.preprocessing.data_splitters.NoSplit.get_metadata_routing evalml.preprocessing.data_splitters.NoSplit.get_n_splits evalml.preprocessing.data_splitters.NoSplit.is_cv evalml.preprocessing.data_splitters.NoSplit.split .. py:method:: get_metadata_routing(self) Get metadata routing of this object. Please check :ref:`User Guide ` on how the routing mechanism works. :returns: **routing** -- A :class:`~utils.metadata_routing.MetadataRequest` encapsulating routing information. :rtype: MetadataRequest .. py:method:: get_n_splits() :staticmethod: Return the number of splits of this object. :returns: Always returns 0. :rtype: int .. py:method:: is_cv(self) :property: Returns whether or not the data splitter is a cross-validation data splitter. :returns: If the splitter is a cross-validation data splitter :rtype: bool .. py:method:: split(self, X, y=None) Divide the data into training and testing sets, where the testing set is empty. :param X: Dataframe of points to split :type X: pd.DataFrame :param y: Series of points to split :type y: pd.Series :returns: Indices to split data into training and test set :rtype: list .. py:class:: StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None) Wrapper class for sklearn's Stratified KFold splitter. **Methods** .. autoapisummary:: :nosignatures: evalml.preprocessing.data_splitters.StratifiedKFold.get_metadata_routing evalml.preprocessing.data_splitters.StratifiedKFold.get_n_splits evalml.preprocessing.data_splitters.StratifiedKFold.is_cv evalml.preprocessing.data_splitters.StratifiedKFold.split .. py:method:: get_metadata_routing(self) Get metadata routing of this object. Please check :ref:`User Guide ` on how the routing mechanism works. :returns: **routing** -- A :class:`~utils.metadata_routing.MetadataRequest` encapsulating routing information. :rtype: MetadataRequest .. py:method:: get_n_splits(self, X=None, y=None, groups=None) Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for compatibility. :type X: object :param y: Always ignored, exists for compatibility. :type y: object :param groups: Always ignored, exists for compatibility. :type groups: object :returns: **n_splits** -- Returns the number of splitting iterations in the cross-validator. :rtype: int .. py:method:: is_cv(self) :property: Returns whether or not the data splitter is a cross-validation data splitter. :returns: If the splitter is a cross-validation data splitter :rtype: bool .. py:method:: split(self, X, y, groups=None) Generate indices to split data into training and test set. :param X: Training data, where `n_samples` is the number of samples and `n_features` is the number of features. Note that providing ``y`` is sufficient to generate the splits and hence ``np.zeros(n_samples)`` may be used as a placeholder for ``X`` instead of actual training data. :type X: array-like of shape (n_samples, n_features) :param y: The target variable for supervised learning problems. Stratification is done based on the y labels. :type y: array-like of shape (n_samples,) :param groups: Always ignored, exists for compatibility. :type groups: object :Yields: * **train** (*ndarray*) -- The training set indices for that split. * **test** (*ndarray*) -- The testing set indices for that split. .. rubric:: Notes Randomized CV splitters may return different results for each call of split. You can make the results identical by setting `random_state` to an integer. .. py:class:: TimeSeriesSplit(max_delay=0, gap=0, forecast_horizon=None, time_index=None, n_series=None, n_splits=3) Rolling Origin Cross Validation for time series problems. The max_delay, gap, and forecast_horizon parameters are only used to validate that the requested split size is not too small given these parameters. :param max_delay: Max delay value for feature engineering. Time series pipelines create delayed features from existing features. This process will introduce NaNs into the first max_delay number of rows. The splitter uses the last max_delay number of rows from the previous split as the first max_delay number of rows of the current split to avoid "throwing out" more data than in necessary. Defaults to 0. :type max_delay: int :param gap: Number of time units separating the data used to generate features and the data to forecast on. Defaults to 0. :type gap: int :param forecast_horizon: Number of time units to forecast. Used for parameter validation. If an integer, will set the size of the cv splits. Defaults to None. :type forecast_horizon: int, None :param time_index: Name of the column containing the datetime information used to order the data. Defaults to None. :type time_index: str :param n_splits: number of data splits to make. Defaults to 3. :type n_splits: int .. rubric:: Example >>> import numpy as np >>> import pandas as pd ... >>> X = pd.DataFrame([i for i in range(10)], columns=["First"]) >>> y = pd.Series([i for i in range(10)]) ... >>> ts_split = TimeSeriesSplit(n_splits=4) >>> generator_ = ts_split.split(X, y) ... >>> first_split = next(generator_) >>> assert (first_split[0] == np.array([0, 1])).all() >>> assert (first_split[1] == np.array([2, 3])).all() ... ... >>> second_split = next(generator_) >>> assert (second_split[0] == np.array([0, 1, 2, 3])).all() >>> assert (second_split[1] == np.array([4, 5])).all() ... ... >>> third_split = next(generator_) >>> assert (third_split[0] == np.array([0, 1, 2, 3, 4, 5])).all() >>> assert (third_split[1] == np.array([6, 7])).all() ... ... >>> fourth_split = next(generator_) >>> assert (fourth_split[0] == np.array([0, 1, 2, 3, 4, 5, 6, 7])).all() >>> assert (fourth_split[1] == np.array([8, 9])).all() **Methods** .. autoapisummary:: :nosignatures: evalml.preprocessing.data_splitters.TimeSeriesSplit.get_metadata_routing evalml.preprocessing.data_splitters.TimeSeriesSplit.get_n_splits evalml.preprocessing.data_splitters.TimeSeriesSplit.is_cv evalml.preprocessing.data_splitters.TimeSeriesSplit.split .. py:method:: get_metadata_routing(self) Get metadata routing of this object. Please check :ref:`User Guide ` on how the routing mechanism works. :returns: **routing** -- A :class:`~utils.metadata_routing.MetadataRequest` encapsulating routing information. :rtype: MetadataRequest .. py:method:: get_n_splits(self, X=None, y=None, groups=None) Get the number of data splits. :param X: Features to split. :type X: pd.DataFrame, None :param y: Target variable to split. Defaults to None. :type y: pd.DataFrame, None :param groups: Ignored but kept for compatibility with sklearn API. Defaults to None. :returns: Number of splits. .. py:method:: is_cv(self) :property: Returns whether or not the data splitter is a cross-validation data splitter. :returns: If the splitter is a cross-validation data splitter :rtype: bool .. py:method:: split(self, X, y=None, groups=None) Get the time series splits. X and y are assumed to be sorted in ascending time order. This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty at the same time. :param X: Features to split. :type X: pd.DataFrame, None :param y: Target variable to split. Defaults to None. :type y: pd.DataFrame, None :param groups: Ignored but kept for compatibility with sklearn API. Defaults to None. :Yields: Iterator of (train, test) indices tuples. :raises ValueError: If one of the proposed splits would be empty. .. py:class:: TrainingValidationSplit(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0) Split the training data into training and validation sets. :param test_size: What percentage of data points should be included in the validation set. Defalts to the complement of `train_size` if `train_size` is set, and 0.25 otherwise. :type test_size: float :param train_size: What percentage of data points should be included in the training set. Defaults to the complement of `test_size` :type train_size: float :param shuffle: Whether to shuffle the data before splitting. Defaults to False. :type shuffle: boolean :param stratify: Splits the data in a stratified fashion, using this argument as class labels. Defaults to None. :type stratify: list :param random_seed: The seed to use for random sampling. Defaults to 0. :type random_seed: int .. rubric:: Examples >>> import numpy as np >>> import pandas as pd ... >>> X = pd.DataFrame([i for i in range(10)], columns=["First"]) >>> y = pd.Series([i for i in range(10)]) ... >>> tv_split = TrainingValidationSplit() >>> split_ = next(tv_split.split(X, y)) >>> assert (split_[0] == np.array([0, 1, 2, 3, 4, 5, 6])).all() >>> assert (split_[1] == np.array([7, 8, 9])).all() ... ... >>> tv_split = TrainingValidationSplit(test_size=0.5) >>> split_ = next(tv_split.split(X, y)) >>> assert (split_[0] == np.array([0, 1, 2, 3, 4])).all() >>> assert (split_[1] == np.array([5, 6, 7, 8, 9])).all() ... ... >>> tv_split = TrainingValidationSplit(shuffle=True) >>> split_ = next(tv_split.split(X, y)) >>> assert (split_[0] == np.array([9, 1, 6, 7, 3, 0, 5])).all() >>> assert (split_[1] == np.array([2, 8, 4])).all() ... ... >>> y = pd.Series([i % 3 for i in range(10)]) >>> tv_split = TrainingValidationSplit(shuffle=True, stratify=y) >>> split_ = next(tv_split.split(X, y)) >>> assert (split_[0] == np.array([1, 9, 3, 2, 8, 6, 7])).all() >>> assert (split_[1] == np.array([0, 4, 5])).all() **Methods** .. autoapisummary:: :nosignatures: evalml.preprocessing.data_splitters.TrainingValidationSplit.get_metadata_routing evalml.preprocessing.data_splitters.TrainingValidationSplit.get_n_splits evalml.preprocessing.data_splitters.TrainingValidationSplit.is_cv evalml.preprocessing.data_splitters.TrainingValidationSplit.split .. py:method:: get_metadata_routing(self) Get metadata routing of this object. Please check :ref:`User Guide ` on how the routing mechanism works. :returns: **routing** -- A :class:`~utils.metadata_routing.MetadataRequest` encapsulating routing information. :rtype: MetadataRequest .. py:method:: get_n_splits() :staticmethod: Return the number of splits of this object. :returns: Always returns 1. :rtype: int .. py:method:: is_cv(self) :property: Returns whether or not the data splitter is a cross-validation data splitter. :returns: If the splitter is a cross-validation data splitter :rtype: bool .. py:method:: split(self, X, y=None) Divide the data into training and testing sets. :param X: Dataframe of points to split :type X: pd.DataFrame :param y: Series of points to split :type y: pd.Series :returns: Indices to split data into training and test set :rtype: list