Preprocessing¶

Subpackages¶

data_splitters

Submodules¶

utils

Package Contents¶

Classes Summary¶

`TimeSeriesSplit`	Rolling Origin Cross Validation for time series problems.
`TrainingValidationSplit`	Split the training data into training and validation sets.

Functions¶

`drop_nan_target_rows`	Drops rows in X and y when row in the target y has a value of NaN.
`load_data`	Load features and target from file.
`number_of_features`	Get the number of features of each specific dtype in a DataFrame.
`split_data`	Splits data into train and test sets.
`target_distribution`	Get the target distributions.

Contents¶

evalml.preprocessing.drop_nan_target_rows(X, y)[source]¶

Drops rows in X and y when row in the target y has a value of NaN.

Parameters

X (pd.DataFrame, np.ndarray) – Data to transform
y (pd.Series, np.ndarray) – Target data

Returns

Transformed X (and y, if passed in) with rows that had a NaN value removed.

Return type

pd.DataFrame, pd.DataFrame

evalml.preprocessing.load_data(path, index, target, n_rows=None, drop=None, verbose=True, **kwargs)[source]¶

Load features and target from file.

Parameters

path (str) – Path to file or a http/ftp/s3 URL
index (str) – Column for index
target (str) – Column for target
n_rows (int) – Number of rows to return
drop (list) – List of columns to drop
verbose (bool) – If True, prints information about features and target

Returns

Features matrix and target

Return type

pd.DataFrame, pd.Series

evalml.preprocessing.number_of_features(dtypes)[source]¶

Get the number of features of each specific dtype in a DataFrame.

Parameters: dtypes (pd.Series) – DataFrame.dtypes to get the number of features for
Returns: dtypes and the number of features for each input type
Return type: pd.Series

evalml.preprocessing.split_data(X, y, problem_type, problem_configuration=None, test_size=0.2, random_seed=0)[source]¶

Splits data into train and test sets.

Parameters

X (pd.DataFrame or np.ndarray) – data of shape [n_samples, n_features]
y (pd.Series, or np.ndarray) – target data of length [n_samples]
problem_type (str or ProblemTypes) – type of supervised learning problem. see evalml.problem_types.problemtype.all_problem_types for a full list.
problem_configuration (dict) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the date_index, gap, and max_delay variables.
test_size (float) – What percentage of data points should be included in the test set. Defaults to 0.2 (20%).
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

Feature and target data each split into train and test sets

Return type

pd.DataFrame, pd.DataFrame, pd.Series, pd.Series

evalml.preprocessing.target_distribution(targets)[source]¶

Get the target distributions.

Parameters: targets (pd.Series) – Target data
Returns: Target data and their frequency distribution as percentages.
Return type: pd.Series

class evalml.preprocessing.TimeSeriesSplit(max_delay=0, gap=0, date_index=None, n_splits=3)[source]¶

Rolling Origin Cross Validation for time series problems.

This class uses max_delay and gap values to take into account that evalml time series pipelines perform some feature and target engineering, e.g delaying input features and shifting the target variable by the desired amount. If the data that will be split already has all the features and appropriate target values, and then set max_delay and gap to 0.

Parameters

max_delay (int) – Max delay value for feature engineering. Time series pipelines create delayed features from existing features. This process will introduce NaNs into the first max_delay number of rows. The splitter uses the last max_delay number of rows from the previous split as the first max_delay number of rows of the current split to avoid “throwing out” more data than in necessary. Defaults to 0.
gap (int) – Gap used in time series problem. Time series pipelines shift the target variable by gap rows. Defaults to 0.
date_index (str) – Name of the column containing the datetime information used to order the data. Defaults to None.
n_splits (int) – number of data splits to make. Defaults to 3.

Methods

`get_n_splits`	Get the number of data splits.
`split`	Get the time series splits.

get_n_splits(self, X=None, y=None, groups=None)[source]¶: Get the number of data splits.

split(self, X, y=None, groups=None)[source]¶

Get the time series splits.

X and y are assumed to be sorted in ascending time order. This method can handle passing in empty or None X and y data but note that X and y cannot be None or empty at the same time.

Parameters

X (pd.DataFrame, None) – Features to split.
y (pd.DataFrame, None) – Target variable to split.
groups – Ignored but kept for compatibility with sklearn api.

Returns

Iterator of (train, test) indices tuples.

class evalml.preprocessing.TrainingValidationSplit(test_size=None, train_size=None, shuffle=False, stratify=None, random_seed=0)[source]¶

Split the training data into training and validation sets.

Parameters

test_size (float) – What percentage of data points should be included in the validation set. Defalts to the complement of train_size if train_size is set, and 0.25 otherwise.
train_size (float) – What percentage of data points should be included in the training set. Defaults to the complement of test_size
shuffle (boolean) – Whether to shuffle the data before splitting. Defaults to False.
stratify (list) – Splits the data in a stratified fashion, using this argument as class labels. Defaults to None.
random_seed (int) – The seed to use for random sampling. Defaults to 0.

Methods

`get_n_splits`	Returns the number of splits of this object
`split`	Divides the data into training and testing sets

static get_n_splits()[source]¶: Returns the number of splits of this object

split(self, X, y=None)[source]¶

Divides the data into training and testing sets

Parameters

X (pd.DataFrame) – Dataframe of points to split
y (pd.Series) – Series of points to split

Returns

Indices to split data into training and test set

Return type

list

utils data_splitters