utils ==================================== .. py:module:: evalml.preprocessing.utils .. autoapi-nested-parse:: Helpful preprocessing utilities. Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: :nosignatures: evalml.preprocessing.utils.load_data evalml.preprocessing.utils.number_of_features evalml.preprocessing.utils.split_data evalml.preprocessing.utils.split_multiseries_data evalml.preprocessing.utils.target_distribution Contents ~~~~~~~~~~~~~~~~~~~ .. py:function:: load_data(path, index, target, n_rows=None, drop=None, verbose=True, **kwargs) Load features and target from file. :param path: Path to file or a http/ftp/s3 URL. :type path: str :param index: Column for index. :type index: str :param target: Column for target. :type target: str :param n_rows: Number of rows to return. Defaults to None. :type n_rows: int :param drop: List of columns to drop. Defaults to None. :type drop: list :param verbose: If True, prints information about features and target. Defaults to True. :type verbose: bool :param \*\*kwargs: Other keyword arguments that should be passed to panda's `read_csv` method. :returns: Features matrix and target. :rtype: pd.DataFrame, pd.Series .. py:function:: number_of_features(dtypes) Get the number of features of each specific dtype in a DataFrame. :param dtypes: DataFrame.dtypes to get the number of features for. :type dtypes: pd.Series :returns: dtypes and the number of features for each input type. :rtype: pd.Series .. rubric:: Example >>> X = pd.DataFrame() >>> X["integers"] = [i for i in range(10)] >>> X["floats"] = [float(i) for i in range(10)] >>> X["strings"] = [str(i) for i in range(10)] >>> X["booleans"] = [bool(i%2) for i in range(10)] Lists the number of columns corresponding to each dtype. >>> number_of_features(X.dtypes) Number of Features Boolean 1 Categorical 1 Numeric 2 .. py:function:: split_data(X, y, problem_type, problem_configuration=None, test_size=None, random_seed=0) Split data into train and test sets. :param X: data of shape [n_samples, n_features] :type X: pd.DataFrame or np.ndarray :param y: target data of length [n_samples] :type y: pd.Series, or np.ndarray :param problem_type: type of supervised learning problem. see evalml.problem_types.problemtype.all_problem_types for a full list. :type problem_type: str or ProblemTypes :param problem_configuration: Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the time_index, gap, and max_delay variables. :type problem_configuration: dict :param test_size: What percentage of data points should be included in the test set. Defaults to 0.2 (20%) for non-timeseries problems and 0.1 (10%) for timeseries problems. :type test_size: float :param random_seed: Seed for the random number generator. Defaults to 0. :type random_seed: int :returns: Feature and target data each split into train and test sets. :rtype: pd.DataFrame, pd.DataFrame, pd.Series, pd.Series :raises ValueError: If the problem_configuration is missing or does not contain both a time_index and series_id for multiseries problems. .. rubric:: Examples >>> X = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["First"]) >>> y = pd.Series([8, 9, 10, 11, 12, 13]) ... >>> X_train, X_validation, y_train, y_validation = split_data(X, y, "regression", random_seed=42) >>> X_train First 5 6 2 3 4 5 3 4 >>> X_validation First 0 1 1 2 >>> y_train 5 13 2 10 4 12 3 11 dtype: int64 >>> y_validation 0 8 1 9 dtype: int64 .. py:function:: split_multiseries_data(X, y, series_id, time_index, **kwargs) Split stacked multiseries data into train and test sets. Unstacked data can use `split_data`. :param X: The input training data of shape [n_samples*n_series, n_features]. :type X: pd.DataFrame :param y: The target training targets of length [n_samples*n_series]. :type y: pd.Series :param series_id: Name of column containing series id. :type series_id: str :param time_index: Name of column containing time index. :type time_index: str :param \*\*kwargs: Additional keyword arguments to pass to the split_data function. :returns: Feature and target data each split into train and test sets. :rtype: pd.DataFrame, pd.DataFrame, pd.Series, pd.Series .. py:function:: target_distribution(targets) Get the target distributions. :param targets: Target data. :type targets: pd.Series :returns: Target data and their frequency distribution as percentages. :rtype: pd.Series .. rubric:: Examples >>> y = pd.Series([1, 2, 4, 1, 3, 3, 1, 2]) >>> print(target_distribution(y).to_string()) Targets 1 37.50% 2 25.00% 3 25.00% 4 12.50% >>> y = pd.Series([True, False, False, False, True]) >>> print(target_distribution(y).to_string()) Targets False 60.00% True 40.00%