utils#

Helpful preprocessing utilities.

Module Contents#

Functions#

load_data

Load features and target from file.

number_of_features

Get the number of features of each specific dtype in a DataFrame.

split_data

Split data into train and test sets.

split_multiseries_data

Split stacked multiseries data into train and test sets. Unstacked data can use split_data.

target_distribution

Get the target distributions.

Contents#

evalml.preprocessing.utils.load_data(path, index, target, n_rows=None, drop=None, verbose=True, **kwargs)[source]#

Load features and target from file.

Parameters
  • path (str) – Path to file or a http/ftp/s3 URL.

  • index (str) – Column for index.

  • target (str) – Column for target.

  • n_rows (int) – Number of rows to return. Defaults to None.

  • drop (list) – List of columns to drop. Defaults to None.

  • verbose (bool) – If True, prints information about features and target. Defaults to True.

  • **kwargs – Other keyword arguments that should be passed to panda’s read_csv method.

Returns

Features matrix and target.

Return type

pd.DataFrame, pd.Series

evalml.preprocessing.utils.number_of_features(dtypes)[source]#

Get the number of features of each specific dtype in a DataFrame.

Parameters

dtypes (pd.Series) – DataFrame.dtypes to get the number of features for.

Returns

dtypes and the number of features for each input type.

Return type

pd.Series

Example

>>> X = pd.DataFrame()
>>> X["integers"] = [i for i in range(10)]
>>> X["floats"] = [float(i) for i in range(10)]
>>> X["strings"] = [str(i) for i in range(10)]
>>> X["booleans"] = [bool(i%2) for i in range(10)]

Lists the number of columns corresponding to each dtype.

>>> number_of_features(X.dtypes)
             Number of Features
Boolean                       1
Categorical                   1
Numeric                       2
evalml.preprocessing.utils.split_data(X, y, problem_type, problem_configuration=None, test_size=None, random_seed=0)[source]#

Split data into train and test sets.

Parameters
  • X (pd.DataFrame or np.ndarray) – data of shape [n_samples, n_features]

  • y (pd.Series, or np.ndarray) – target data of length [n_samples]

  • problem_type (str or ProblemTypes) – type of supervised learning problem. see evalml.problem_types.problemtype.all_problem_types for a full list.

  • problem_configuration (dict) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the time_index, gap, and max_delay variables.

  • test_size (float) – What percentage of data points should be included in the test set. Defaults to 0.2 (20%) for non-timeseries problems and 0.1 (10%) for timeseries problems.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

Feature and target data each split into train and test sets.

Return type

pd.DataFrame, pd.DataFrame, pd.Series, pd.Series

Examples

>>> X = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["First"])
>>> y = pd.Series([8, 9, 10, 11, 12, 13])
...
>>> X_train, X_validation, y_train, y_validation = split_data(X, y, "regression", random_seed=42)
>>> X_train
   First
5      6
2      3
4      5
3      4
>>> X_validation
   First
0      1
1      2
>>> y_train
5    13
2    10
4    12
3    11
dtype: int64
>>> y_validation
0    8
1    9
dtype: int64
evalml.preprocessing.utils.split_multiseries_data(X, y, series_id, time_index, **kwargs)[source]#

Split stacked multiseries data into train and test sets. Unstacked data can use split_data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples*n_series, n_features].

  • y (pd.Series) – The target training targets of length [n_samples*n_series].

  • series_id (str) – Name of column containing series id.

  • time_index (str) – Name of column containing time index.

  • **kwargs – Additional keyword arguments to pass to the split_data function.

Returns

Feature and target data each split into train and test sets.

Return type

pd.DataFrame, pd.DataFrame, pd.Series, pd.Series

evalml.preprocessing.utils.target_distribution(targets)[source]#

Get the target distributions.

Parameters

targets (pd.Series) – Target data.

Returns

Target data and their frequency distribution as percentages.

Return type

pd.Series

Examples

>>> y = pd.Series([1, 2, 4, 1, 3, 3, 1, 2])
>>> print(target_distribution(y).to_string())
Targets
1    37.50%
2    25.00%
3    25.00%
4    12.50%
>>> y = pd.Series([True, False, False, False, True])
>>> print(target_distribution(y).to_string())
Targets
False    60.00%
True     40.00%