utils¶

Module Contents¶

`drop_nan_target_rows`	Drops rows in X and y when row in the target y has a value of NaN.
`load_data`	Load features and target from file.
`number_of_features`	Get the number of features of each specific dtype in a DataFrame.
`split_data`	Splits data into train and test sets.
`target_distribution`	Get the target distributions.

evalml.preprocessing.utils.drop_nan_target_rows(X, y)[source]¶

Drops rows in X and y when row in the target y has a value of NaN.

Parameters

Returns

Transformed X (and y, if passed in) with rows that had a NaN value removed.

Return type

pd.DataFrame, pd.DataFrame

evalml.preprocessing.utils.load_data(path, index, target, n_rows=None, drop=None, verbose=True, **kwargs)[source]¶

Load features and target from file.

Parameters

Returns

Features matrix and target

Return type

pd.DataFrame, pd.Series

evalml.preprocessing.utils.number_of_features(dtypes)[source]¶

Get the number of features of each specific dtype in a DataFrame.

Parameters: dtypes (pd.Series) – DataFrame.dtypes to get the number of features for
Returns: dtypes and the number of features for each input type
Return type: pd.Series

evalml.preprocessing.utils.split_data(X, y, problem_type, problem_configuration=None, test_size=0.2, random_seed=0)[source]¶

Splits data into train and test sets.

Parameters

X (pd.DataFrame or np.ndarray) – data of shape [n_samples, n_features]
y (pd.Series, or np.ndarray) – target data of length [n_samples]
problem_type (str or ProblemTypes) – type of supervised learning problem. see evalml.problem_types.problemtype.all_problem_types for a full list.
problem_configuration (dict) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the date_index, gap, and max_delay variables.
test_size (float) – What percentage of data points should be included in the test set. Defaults to 0.2 (20%).
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

Feature and target data each split into train and test sets

Return type

pd.DataFrame, pd.DataFrame, pd.Series, pd.Series

evalml.preprocessing.utils.target_distribution(targets)[source]¶

Get the target distributions.

training_validation_split Problem Types