utils
====================================

.. py:module:: evalml.preprocessing.utils

.. autoapi-nested-parse::

   Helpful preprocessing utilities.


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::
   :nosignatures:

   evalml.preprocessing.utils.load_data
   evalml.preprocessing.utils.number_of_features
   evalml.preprocessing.utils.split_data
   evalml.preprocessing.utils.split_multiseries_data
   evalml.preprocessing.utils.target_distribution


Contents
~~~~~~~~~~~~~~~~~~~
.. py:function:: load_data(path, index, target, n_rows=None, drop=None, verbose=True, **kwargs)

   Load features and target from file.

   :param path: Path to file or a http/ftp/s3 URL.
   :type path: str
   :param index: Column for index.
   :type index: str
   :param target: Column for target.
   :type target: str
   :param n_rows: Number of rows to return. Defaults to None.
   :type n_rows: int
   :param drop: List of columns to drop. Defaults to None.
   :type drop: list
   :param verbose: If True, prints information about features and target. Defaults to True.
   :type verbose: bool
   :param \*\*kwargs: Other keyword arguments that should be passed to panda's `read_csv` method.

   :returns: Features matrix and target.
   :rtype: pd.DataFrame, pd.Series


.. py:function:: number_of_features(dtypes)

   Get the number of features of each specific dtype in a DataFrame.

   :param dtypes: DataFrame.dtypes to get the number of features for.
   :type dtypes: pd.Series

   :returns: dtypes and the number of features for each input type.
   :rtype: pd.Series

   .. rubric:: Example

   >>> X = pd.DataFrame()
   >>> X["integers"] = [i for i in range(10)]
   >>> X["floats"] = [float(i) for i in range(10)]
   >>> X["strings"] = [str(i) for i in range(10)]
   >>> X["booleans"] = [bool(i%2) for i in range(10)]

   Lists the number of columns corresponding to each dtype.

   >>> number_of_features(X.dtypes)
                Number of Features
   Boolean                       1
   Categorical                   1
   Numeric                       2


.. py:function:: split_data(X, y, problem_type, problem_configuration=None, test_size=None, random_seed=0)

   Split data into train and test sets.

   :param X: data of shape [n_samples, n_features]
   :type X: pd.DataFrame or np.ndarray
   :param y: target data of length [n_samples]
   :type y: pd.Series, or np.ndarray
   :param problem_type: type of supervised learning problem. see evalml.problem_types.problemtype.all_problem_types for a full list.
   :type problem_type: str or ProblemTypes
   :param problem_configuration: Additional parameters needed to configure the search. For example,
                                 in time series problems, values should be passed in for the time_index, gap, and max_delay variables.
   :type problem_configuration: dict
   :param test_size: What percentage of data points should be included in the test set. Defaults to 0.2 (20%) for non-timeseries problems and 0.1
                     (10%) for timeseries problems.
   :type test_size: float
   :param random_seed: Seed for the random number generator. Defaults to 0.
   :type random_seed: int

   :returns: Feature and target data each split into train and test sets.
   :rtype: pd.DataFrame, pd.DataFrame, pd.Series, pd.Series

   :raises ValueError: If the problem_configuration is missing or does not contain both a time_index and series_id for multiseries problems.

   .. rubric:: Examples

   >>> X = pd.DataFrame([1, 2, 3, 4, 5, 6], columns=["First"])
   >>> y = pd.Series([8, 9, 10, 11, 12, 13])
   ...
   >>> X_train, X_validation, y_train, y_validation = split_data(X, y, "regression", random_seed=42)
   >>> X_train
      First
   5      6
   2      3
   4      5
   3      4
   >>> X_validation
      First
   0      1
   1      2
   >>> y_train
   5    13
   2    10
   4    12
   3    11
   dtype: int64
   >>> y_validation
   0    8
   1    9
   dtype: int64


.. py:function:: split_multiseries_data(X, y, series_id, time_index, **kwargs)

   Split stacked multiseries data into train and test sets. Unstacked data can use `split_data`.

   :param X: The input training data of shape [n_samples*n_series, n_features].
   :type X: pd.DataFrame
   :param y: The target training targets of length [n_samples*n_series].
   :type y: pd.Series
   :param series_id: Name of column containing series id.
   :type series_id: str
   :param time_index: Name of column containing time index.
   :type time_index: str
   :param \*\*kwargs: Additional keyword arguments to pass to the split_data function.

   :returns: Feature and target data each split into train and test sets.
   :rtype: pd.DataFrame, pd.DataFrame, pd.Series, pd.Series


.. py:function:: target_distribution(targets)

   Get the target distributions.

   :param targets: Target data.
   :type targets: pd.Series

   :returns: Target data and their frequency distribution as percentages.
   :rtype: pd.Series

   .. rubric:: Examples

   >>> y = pd.Series([1, 2, 4, 1, 3, 3, 1, 2])
   >>> print(target_distribution(y).to_string())
   Targets
   1    37.50%
   2    25.00%
   3    25.00%
   4    12.50%
   >>> y = pd.Series([True, False, False, False, True])
   >>> print(target_distribution(y).to_string())
   Targets
   False    60.00%
   True     40.00%