utils¶
Helpful preprocessing utilities.
Module Contents¶
Functions¶
Load features and target from file. |
|
Get the number of features of each specific dtype in a DataFrame. |
|
Split data into train and test sets. |
|
Get the target distributions. |
Contents¶
-
evalml.preprocessing.utils.
load_data
(path, index, target, n_rows=None, drop=None, verbose=True, **kwargs)[source]¶ Load features and target from file.
- Parameters
path (str) – Path to file or a http/ftp/s3 URL.
index (str) – Column for index.
target (str) – Column for target.
n_rows (int) – Number of rows to return. Defaults to None.
drop (list) – List of columns to drop. Defaults to None.
verbose (bool) – If True, prints information about features and target. Defaults to True.
**kwargs – Other keyword arguments that should be passed to panda’s read_csv method.
- Returns
Features matrix and target.
- Return type
pd.DataFrame, pd.Series
-
evalml.preprocessing.utils.
number_of_features
(dtypes)[source]¶ Get the number of features of each specific dtype in a DataFrame.
- Parameters
dtypes (pd.Series) – DataFrame.dtypes to get the number of features for.
- Returns
dtypes and the number of features for each input type.
- Return type
pd.Series
Example
>>> X = pd.DataFrame() >>> X["integers"] = [i for i in range(10)] >>> X["floats"] = [float(i) for i in range(10)] >>> X["strings"] = [str(i) for i in range(10)] >>> X["booleans"] = [bool(i%2) for i in range(10)]
Lists the number of columns corresponding to each dtype.
>>> number_of_features(X.dtypes) Number of Features Boolean 1 Categorical 1 Numeric 2
-
evalml.preprocessing.utils.
split_data
(X, y, problem_type, problem_configuration=None, test_size=0.2, random_seed=0)[source]¶ Split data into train and test sets.
- Parameters
X (pd.DataFrame or np.ndarray) – data of shape [n_samples, n_features]
y (pd.Series, or np.ndarray) – target data of length [n_samples]
problem_type (str or ProblemTypes) – type of supervised learning problem. see evalml.problem_types.problemtype.all_problem_types for a full list.
problem_configuration (dict) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the time_index, gap, and max_delay variables.
test_size (float) – What percentage of data points should be included in the test set. Defaults to 0.2 (20%).
random_seed (int) – Seed for the random number generator. Defaults to 0.
- Returns
Feature and target data each split into train and test sets.
- Return type
pd.DataFrame, pd.DataFrame, pd.Series, pd.Series
Examples
>>> X = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["First"]) >>> y = pd.Series([8, 9, 10, 11, 12, 13]) ... >>> X_train, X_validation, y_train, y_validation = split_data(X, y, "regression", random_seed=42) >>> X_train First 5 6 2 3 4 5 3 4 >>> X_validation First 0 1 1 2 >>> y_train 5 13 2 10 4 12 3 11 dtype: int64 >>> y_validation 0 8 1 9 dtype: int64
-
evalml.preprocessing.utils.
target_distribution
(targets)[source]¶ Get the target distributions.
- Parameters
targets (pd.Series) – Target data.
- Returns
Target data and their frequency distribution as percentages.
- Return type
pd.Series
Examples
>>> y = pd.Series([1, 2, 4, 1, 3, 3, 1, 2]) >>> target_distribution(y) Targets 1 37.50% 2 25.00% 3 25.00% 4 12.50% dtype: object >>> y = pd.Series([True, False, False, False, True]) >>> target_distribution(y) Targets False 60.00% True 40.00% dtype: object