Preprocessing#

Preprocessing utilities.

Subpackages#

data_splitters

Submodules#

utils

Package Contents#

Classes Summary#

`NoSplit`	Does not split the training data into training and validation sets.
`TimeSeriesSplit`	Rolling Origin Cross Validation for time series problems.
`TrainingValidationSplit`	Split the training data into training and validation sets.

Functions#

`load_data`	Load features and target from file.
`number_of_features`	Get the number of features of each specific dtype in a DataFrame.
`split_data`	Split data into train and test sets.
`split_multiseries_data`	Split stacked multiseries data into train and test sets. Unstacked data can use split_data.
`target_distribution`	Get the target distributions.

Contents#

evalml.preprocessing.load_data(path, index, target, n_rows=None, drop=None, verbose=True, **kwargs)[source]#

Load features and target from file.

Parameters

path (str) – Path to file or a http/ftp/s3 URL.
index (str) – Column for index.
target (str) – Column for target.
n_rows (int) – Number of rows to return. Defaults to None.
drop (list) – List of columns to drop. Defaults to None.
verbose (bool) – If True, prints information about features and target. Defaults to True.
**kwargs – Other keyword arguments that should be passed to panda’s read_csv method.

Returns

Features matrix and target.

Return type

pd.DataFrame, pd.Series

class evalml.preprocessing.NoSplit(random_seed=0)[source]#

Does not split the training data into training and validation sets.

All data is passed as the training set, test data is simply an array of None. To be used for future unsupervised learning, should not be used in any of the currently supported pipelines.

Parameters: random_seed (int) – The seed to use for random sampling. Defaults to 0. Not used.

Methods

`get_metadata_routing`	Get metadata routing of this object.
`get_n_splits`	Return the number of splits of this object.
`is_cv`	Returns whether or not the data splitter is a cross-validation data splitter.
`split`	Divide the data into training and testing sets, where the testing set is empty.

get_metadata_routing(self)#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns: routing – A MetadataRequest encapsulating routing information.
Return type: MetadataRequest

static get_n_splits()[source]#

Return the number of splits of this object.

Returns: Always returns 0.
Return type: int

property is_cv(self)#

Returns whether or not the data splitter is a cross-validation data splitter.

Returns: If the splitter is a cross-validation data splitter
Return type: bool

split(self, X, y=None)[source]#

Divide the data into training and testing sets, where the testing set is empty.

Parameters

X (pd.DataFrame) – Dataframe of points to split
y (pd.Series) – Series of points to split

Returns

Indices to split data into training and test set

Return type

list

evalml.preprocessing.number_of_features(dtypes)[source]#

Get the number of features of each specific dtype in a DataFrame.

Parameters: dtypes (pd.Series) – DataFrame.dtypes to get the number of features for.
Returns: dtypes and the number of features for each input type.
Return type: pd.Series

Example

>>> X = pd.DataFrame()
>>> X["integers"] = [i for i in range(10)]
>>> X["floats"] = [float(i) for i in range(10)]
>>> X["strings"] = [str(i) for i in range(10)]
>>> X["booleans"] = [bool(i%2) for i in range(10)]

Lists the number of columns corresponding to each dtype.

>>> number_of_features(X.dtypes)
             Number of Features
Boolean                       1
Categorical                   1
Numeric                       2

evalml.preprocessing.split_data(X, y, problem_type, problem_configuration=None, test_size=None, random_seed=0)[source]#

Split data into train and test sets.

Parameters

X (pd.DataFrame or np.ndarray) – data of shape [n_samples, n_features]
y (pd.Series, or np.ndarray) – target data of length [n_samples]
problem_type (str or ProblemTypes) – type of supervised learning problem. see evalml.problem_types.problemtype.all_problem_types for a full list.
problem_configuration (dict) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the time_index, gap, and max_delay variables.
test_size (float) – What percentage of data points should be included in the test set. Defaults to 0.2 (20%) for non-timeseries problems and 0.1 (10%) for timeseries problems.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

Feature and target data each split into train and test sets.

Return type

pd.DataFrame, pd.DataFrame, pd.Series, pd.Series

Raises

ValueError – If the problem_configuration is missing or does not contain both a time_index and series_id for multiseries problems.

Examples

>>> X = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["First"])
>>> y = pd.Series([8, 9, 10, 11, 12, 13])
...
>>> X_train, X_validation, y_train, y_validation = split_data(X, y, "regression", random_seed=42)
>>> X_train
   First
5      6
2      3
4      5
3      4
>>> X_validation
   First
0      1
1      2
>>> y_train
5    13
2    10
4    12
3    11
dtype: int64
>>> y_validation
0    8
1    9
dtype: int64

evalml.preprocessing.split_multiseries_data(X, y, series_id, time_index, **kwargs)[source]#

Split stacked multiseries data into train and test sets. Unstacked data can use split_data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples*n_series, n_features].
y (pd.Series) – The target training targets of length [n_samples*n_series].
series_id (str) – Name of column containing series id.
time_index (str) – Name of column containing time index.
**kwargs – Additional keyword arguments to pass to the split_data function.

Returns

Feature and target data each split into train and test sets.

Return type

pd.DataFrame, pd.DataFrame, pd.Series, pd.Series

evalml.preprocessing.target_distribution(targets)[source]#

Get the target distributions.

Parameters: targets (pd.Series) – Target data.
Returns: Target data and their frequency distribution as percentages.
Return type: pd.Series

Examples

>>> y = pd.Series([1, 2, 4, 1, 3, 3, 1, 2])
>>> print(target_distribution(y).to_string())
Targets
1    37.50%
2    25.00%
3    25.00%
4    12.50%
>>> y = pd.Series([True, False, False, False, True])
>>> print(target_distribution(y).to_string())
Targets
False    60.00%
True     40.00%

class evalml.preprocessing.TimeSeriesSplit(max_delay=0, gap=0, forecast_horizon=None, time_index=None, n_series=None, n_splits=3)[source]#

Rolling Origin Cross Validation for time series problems.

The max_delay, gap, and forecast_horizon parameters are only used to validate that the requested split size is not too small given these parameters.

Parameters

max_delay (int) – Max delay value for feature engineering. Time series pipelines create delayed features from existing features. This process will introduce NaNs into the first max_delay number of rows. The splitter uses the last max_delay number of rows from the previous split as the first max_delay number of rows of the current split to avoid “throwing out” more data than in necessary. Defaults to 0.
gap (int) – Number of time units separating the data used to generate features and the data to forecast on. Defaults to 0.
forecast_horizon (int, None) – Number of time units to forecast. Used for parameter validation. If an integer, will set the size of the cv splits. Defaults to None.
time_index (str) – Name of the column containing the datetime information used to order the data. Defaults to None.
n_splits (int) – number of data splits to make. Defaults to 3.

Example

>>> import numpy as np
>>> import pandas as pd
...
>>> X = pd.DataFrame([i for i in range(10)], columns=["First"])
>>> y = pd.Series([i for i in range(10)])
...
>>> ts_split = TimeSeriesSplit(n_splits=4)
>>> generator_ = ts_split.split(X, y)
...
>>> first_split = next(generator_)
>>> assert (first_split[0] == np.array([0, 1])).all()
>>> assert (first_split[1] == np.array([2, 3])).all()
...
...
>>> second_split = next(generator_)
>>> assert (second_split[0] == np.array([0, 1, 2, 3])).all()
>>> assert (second_split[1] == np.array([4, 5])).all()
...
...
>>> third_split = next(generator_)
>>> assert (third_split[0] == np.array([0, 1, 2, 3, 4, 5])).all()
>>> assert (third_split[1] == np.array([6, 7])).all()
...
...
>>> fourth_split = next(generator_)
>>> assert (fourth_split[0] == np.array([0, 1, 2, 3, 4, 5, 6, 7])).all()
>>> assert (fourth_split[1] == np.array([8, 9])).all()

Methods

`get_metadata_routing`	Get metadata routing of this object.
`get_n_splits`	Get the number of data splits.
`is_cv`	Returns whether or not the data splitter is a cross-validation data splitter.
`split`	Get the time series splits.

get_metadata_routing(self)#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns: routing – A MetadataRequest encapsulating routing information.
Return type: MetadataRequest

get_n_splits(self, X=None, y=None, groups=None)[source]#

Get the number of data splits.

Parameters

X (pd.DataFrame, None) – Features to split.
y (pd.DataFrame, None) – Target variable to split. Defaults to None.
groups – Ignored but kept for compatibility with sklearn API. Defaults to None.

Returns

Number of splits.

property is_cv(self)#

Returns whether or not the data splitter is a cross-validation data splitter.

Returns: If the splitter is a cross-validation data splitter
Return type: bool

split(self, X, y=None, groups=None)[source]#

Get the time series splits.

X and y are assumed to be sorted in ascending time order. This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty at the same time.

Parameters

X (pd.DataFrame, None) – Features to split.
y (pd.DataFrame, None) – Target variable to split. Defaults to None.
groups – Ignored but kept for compatibility with sklearn API. Defaults to None.

Yields

Iterator of (train, test) indices tuples.

Raises

ValueError – If one of the proposed splits would be empty.

class evalml.preprocessing.TrainingValidationSplit(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0)[source]#

Split the training data into training and validation sets.

Parameters

test_size (float) – What percentage of data points should be included in the validation set. Defalts to the complement of train_size if train_size is set, and 0.25 otherwise.
train_size (float) – What percentage of data points should be included in the training set. Defaults to the complement of test_size
shuffle (boolean) – Whether to shuffle the data before splitting. Defaults to False.
stratify (list) – Splits the data in a stratified fashion, using this argument as class labels. Defaults to None.
random_seed (int) – The seed to use for random sampling. Defaults to 0.

Examples

>>> import numpy as np
>>> import pandas as pd
...
>>> X = pd.DataFrame([i for i in range(10)], columns=["First"])
>>> y = pd.Series([i for i in range(10)])
...
>>> tv_split = TrainingValidationSplit()
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([0, 1, 2, 3, 4, 5, 6])).all()
>>> assert (split_[1] == np.array([7, 8, 9])).all()
...
...
>>> tv_split = TrainingValidationSplit(test_size=0.5)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([0, 1, 2, 3, 4])).all()
>>> assert (split_[1] == np.array([5, 6, 7, 8, 9])).all()
...
...
>>> tv_split = TrainingValidationSplit(shuffle=True)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([9, 1, 6, 7, 3, 0, 5])).all()
>>> assert (split_[1] == np.array([2, 8, 4])).all()
...
...
>>> y = pd.Series([i % 3 for i in range(10)])
>>> tv_split = TrainingValidationSplit(shuffle=True, stratify=y)
>>> split_ = next(tv_split.split(X, y))
>>> assert (split_[0] == np.array([1, 9, 3, 2, 8, 6, 7])).all()
>>> assert (split_[1] == np.array([0, 4, 5])).all()

Methods

`get_metadata_routing`	Get metadata routing of this object.
`get_n_splits`	Return the number of splits of this object.
`is_cv`	Returns whether or not the data splitter is a cross-validation data splitter.
`split`	Divide the data into training and testing sets.

get_metadata_routing(self)#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns: routing – A MetadataRequest encapsulating routing information.
Return type: MetadataRequest

static get_n_splits()[source]#

Return the number of splits of this object.

Returns: Always returns 1.
Return type: int

property is_cv(self)#

Returns whether or not the data splitter is a cross-validation data splitter.

Returns: If the splitter is a cross-validation data splitter
Return type: bool

split(self, X, y=None)[source]#

Divide the data into training and testing sets.

Parameters

X (pd.DataFrame) – Dataframe of points to split
y (pd.Series) – Series of points to split

Returns

Indices to split data into training and test set

Return type

list