utils ================================ .. py:module:: evalml.pipelines.utils .. autoapi-nested-parse:: Utility methods for EvalML pipelines. Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: :nosignatures: evalml.pipelines.utils.generate_pipeline_code evalml.pipelines.utils.generate_pipeline_example evalml.pipelines.utils.get_actions_from_option_defaults evalml.pipelines.utils.make_pipeline evalml.pipelines.utils.make_pipeline_from_actions evalml.pipelines.utils.make_pipeline_from_data_check_output evalml.pipelines.utils.make_timeseries_baseline_pipeline evalml.pipelines.utils.rows_of_interest evalml.pipelines.utils.stack_data evalml.pipelines.utils.stack_X evalml.pipelines.utils.unstack_multiseries Attributes Summary ~~~~~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.pipelines.utils.DECOMPOSER_PERIOD_CAP evalml.pipelines.utils.MULTISERIES_SEPARATOR_SYMBOL Contents ~~~~~~~~~~~~~~~~~~~ .. py:data:: DECOMPOSER_PERIOD_CAP :annotation: = 1000 .. py:function:: generate_pipeline_code(element, features_path=None) Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline. :param element: The instance of the pipeline to generate string Python code. :type element: pipeline instance :param features_path: path to features json created from featuretools.save_features(). Defaults to None. :type features_path: str :returns: String representation of Python code that can be run separately in order to recreate the pipeline instance. Does not include code for custom component implementation. :rtype: str :raises ValueError: If element is not a pipeline, or if the pipeline is nonlinear. :raises ValueError: If features in `features_path` do not match the features on the pipeline. .. py:function:: generate_pipeline_example(pipeline, path_to_train, path_to_holdout, target, path_to_features=None, path_to_mapping='', output_file_path=None) Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline. :param pipeline: The instance of the pipeline to generate string Python code. :type pipeline: pipeline instance :param path_to_train: path to training data. :type path_to_train: str :param path_to_holdout: path to holdout data. :type path_to_holdout: str :param target: target variable. :type target: str :param path_to_features: path to features json. Defaults to None. :type path_to_features: str :param path_to_mapping: path to mapping json. Defaults to None. :type path_to_mapping: str :param output_file_path: path to output python file. Defaults to None. :type output_file_path: str :returns: String representation of Python code that can be run separately in order to recreate the pipeline instance. Does not include code for custom component implementation. :rtype: str .. py:function:: get_actions_from_option_defaults(action_options) Returns a list of actions based on the defaults parameters of each option in the input DataCheckActionOption list. :param action_options: List of DataCheckActionOption objects :type action_options: list[DataCheckActionOption] :returns: List of actions based on the defaults parameters of each option in the input list. :rtype: list[DataCheckAction] .. py:function:: make_pipeline(X, y, estimator, problem_type, parameters=None, sampler_name=None, extra_components_before=None, extra_components_after=None, use_estimator=True, known_in_advance=None, features=False, exclude_featurizers=None, include_decomposer=True) Given input data, target data, an estimator class and the problem type, generates a pipeline class with a preprocessing chain which was recommended based on the inputs. The pipeline will be a subclass of the appropriate pipeline base class for the specified problem_type. :param X: The input data of shape [n_samples, n_features]. :type X: pd.DataFrame :param y: The target data of length [n_samples]. :type y: pd.Series :param estimator: Estimator for pipeline. :type estimator: Estimator :param problem_type: Problem type for pipeline to generate. :type problem_type: ProblemTypes or str :param parameters: Dictionary with component names as keys and dictionary of that component's parameters as values. An empty dictionary or None implies using all default values for component parameters. :type parameters: dict :param sampler_name: The name of the sampler component to add to the pipeline. Only used in classification problems. Defaults to None :type sampler_name: str :param extra_components_before: List of extra components to be added before preprocessing components. Defaults to None. :type extra_components_before: list[ComponentBase] :param extra_components_after: List of extra components to be added after preprocessing components. Defaults to None. :type extra_components_after: list[ComponentBase] :param use_estimator: Whether to add the provided estimator to the pipeline or not. Defaults to True. :type use_estimator: bool :param known_in_advance: List of features that are known in advance. :type known_in_advance: list[str], None :param features: Whether to add a DFSTransformer component to this pipeline. :type features: bool :param exclude_featurizers: A list of featurizer components to exclude from the pipeline. Valid options are "DatetimeFeaturizer", "EmailFeaturizer", "URLFeaturizer", "NaturalLanguageFeaturizer", "TimeSeriesFeaturizer" :type exclude_featurizers: list[str] :param include_decomposer: For time series regression problems, whether or not to include a decomposer in the generated pipeline. Defaults to True. :type include_decomposer: bool :returns: PipelineBase instance with dynamically generated preprocessing components and specified estimator. :rtype: PipelineBase object :raises ValueError: If estimator is not valid for the given problem type, or sampling is not supported for the given problem type. .. py:function:: make_pipeline_from_actions(problem_type, actions, problem_configuration=None) Creates a pipeline of components to address the input DataCheckAction list. :param problem_type: The problem type that the pipeline should address. :type problem_type: str or ProblemType :param actions: List of DataCheckAction objects used to create list of components :type actions: list[DataCheckAction] :param problem_configuration: Required for time series problem types. Values should be passed in for time_index, gap, forecast_horizon, and max_delay. :type problem_configuration: dict :returns: Pipeline which can be used to address data check actions. :rtype: PipelineBase .. py:function:: make_pipeline_from_data_check_output(problem_type, data_check_output, problem_configuration=None) Creates a pipeline of components to address warnings and errors output from running data checks. Uses all default suggestions. :param problem_type: The problem type. :type problem_type: str or ProblemType :param data_check_output: Output from calling ``DataCheck.validate()``. :type data_check_output: dict :param problem_configuration: Required for time series problem types. Values should be passed in for time_index, gap, forecast_horizon, and max_delay. :type problem_configuration: dict :returns: Pipeline which can be used to address data check outputs. :rtype: PipelineBase :raises ValueError: If problem_type is of type time series but an incorrect problem_configuration has been passed. .. py:function:: make_timeseries_baseline_pipeline(problem_type, gap, forecast_horizon, time_index, exclude_featurizer=False, series_id=None) Make a baseline pipeline for time series regression problems. :param problem_type: One of TIME_SERIES_REGRESSION, TIME_SERIES_MULTICLASS, TIME_SERIES_BINARY :param gap: Non-negative gap parameter. :type gap: int :param forecast_horizon: Positive forecast_horizon parameter. :type forecast_horizon: int :param time_index: Column name of time_index parameter. :type time_index: str :param exclude_featurizer: Whether or not to exclude the TimeSeriesFeaturizer from the baseline graph. Defaults to False. :type exclude_featurizer: bool :param series_id: Column name of series_id parameter. Only used for multiseries time series. Defaults to None. :type series_id: str :returns: TimeSeriesPipelineBase, a time series pipeline corresponding to the problem type. .. py:data:: MULTISERIES_SEPARATOR_SYMBOL :annotation: = | .. py:function:: rows_of_interest(pipeline, X, y=None, threshold=None, epsilon=0.1, sort_values=True, types='all') Get the row indices of the data that are closest to the threshold. Works only for binary classification problems and pipelines. :param pipeline: The fitted binary pipeline. :type pipeline: PipelineBase :param X: The input features to predict on. :type X: ww.DataTable, pd.DataFrame :param y: The input target data, if available. Defaults to None. :type y: ww.DataColumn, pd.Series, None :param threshold: The threshold value of interest to separate positive and negative predictions. If None, uses the pipeline threshold if set, else 0.5. Defaults to None. :type threshold: float :param epsilon: The difference between the probability and the threshold that would make the row interesting for us. For instance, epsilon=0.1 and threhsold=0.5 would mean we consider all rows in [0.4, 0.6] to be of interest. Defaults to 0.1. :type epsilon: epsilon :param sort_values: Whether to return the indices sorted by the distance from the threshold, such that the first values are closer to the threshold and the later values are further. Defaults to True. :type sort_values: bool :param types: The type of rows to keep and return. Can be one of ['incorrect', 'correct', 'true_positive', 'true_negative', 'all']. Defaults to 'all'. 'incorrect' - return only the rows where the predictions are incorrect. This means that, given the threshold and target y, keep only the rows which are labeled wrong. 'correct' - return only the rows where the predictions are correct. This means that, given the threshold and target y, keep only the rows which are correctly labeled. 'true_positive' - return only the rows which are positive, as given by the targets. 'true_negative' - return only the rows which are negative, as given by the targets. 'all' - return all rows. This is the only option available when there is no target data provided. :type types: str :returns: The indices corresponding to the rows of interest. :raises ValueError: If pipeline is not a fitted Binary Classification pipeline. :raises ValueError: If types is invalid or y is not provided when types is not 'all'. :raises ValueError: If the threshold is provided and is exclusive of [0, 1]. .. py:function:: stack_data(data, include_series_id=False, series_id_name=None, starting_index=None) Stacks the given DataFrame back into a single Series, or a DataFrame if include_series_id is True. Should only be used for data that is expected to be a single series. To stack multiple unstacked columns, use `stack_X`. :param data: The data to stack. :type data: pd.DataFrame :param include_series_id: Whether or not to extract the series id and include it in a separate columns :type include_series_id: bool :param series_id_name: If include_series_id is True, the series_id name to set for the column. The column will be named 'series_id' if this parameter is None. :type series_id_name: str :param starting_index: The starting index to use for the stacked series. If None and the input index is numeric, the starting index will match that of the input data. If None and the input index is a DatetimeIndex, the index will be the input data's index repeated over the number of columns in the input data. :type starting_index: int :returns: The data in stacked series form. :rtype: pd.Series or pd.DataFrame .. py:function:: stack_X(X, series_id_name, time_index, starting_index=None, series_id_values=None) Restacks the unstacked features into a single DataFrame. :param X: The unstacked features. :type X: pd.DataFrame :param series_id_name: The name of the series id column. :type series_id_name: str :param time_index: The name of the time index column. :type time_index: str :param starting_index: The starting index to use for the stacked DataFrame. If None, the starting index will match that of the input data. Defaults to None. :type starting_index: int :param series_id_values: The unique values of a series ID, used to generate the index. If None, values will be generated from X column values. Required if X only has time index values and no exogenous values. Defaults to None. :type series_id_values: list :returns: The restacked features. :rtype: pd.DataFrame .. py:function:: unstack_multiseries(X, y, series_id, time_index, target_name) Converts multiseries data with one series_id column and one target column to one target column per series id. Datetime information will be preserved only as a column in X. :param X: Data of shape [n_samples, n_features]. :type X: pd.DataFrame :param y: Target data. :type y: pd.Series :param series_id: The column which identifies which series each row belongs to. :type series_id: str :param time_index: Specifies the name of the column in X that provides the datetime objects. :type time_index: str :param target_name: The name of the target column. :type target_name: str :returns: The unstacked X and y data. :rtype: pd.DataFrame, pd.DataFrame