utils#

Utility methods for EvalML pipelines.

Module Contents#

Functions#

generate_pipeline_code

Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline.

generate_pipeline_example

Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline.

get_actions_from_option_defaults

Returns a list of actions based on the defaults parameters of each option in the input DataCheckActionOption list.

make_pipeline

Given input data, target data, an estimator class and the problem type, generates a pipeline class with a preprocessing chain which was recommended based on the inputs. The pipeline will be a subclass of the appropriate pipeline base class for the specified problem_type.

make_pipeline_from_actions

Creates a pipeline of components to address the input DataCheckAction list.

make_pipeline_from_data_check_output

Creates a pipeline of components to address warnings and errors output from running data checks. Uses all default suggestions.

make_timeseries_baseline_pipeline

Make a baseline pipeline for time series regression problems.

rows_of_interest

Get the row indices of the data that are closest to the threshold. Works only for binary classification problems and pipelines.

Attributes Summary#

DECOMPOSER_PERIOD_CAP

Contents#

evalml.pipelines.utils.DECOMPOSER_PERIOD_CAP = 1000#
evalml.pipelines.utils.generate_pipeline_code(element, features_path=None)[source]#

Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline.

Parameters
  • element (pipeline instance) – The instance of the pipeline to generate string Python code.

  • features_path (str) – path to features json created from featuretools.save_features(). Defaults to None.

Returns

String representation of Python code that can be run separately in order to recreate the pipeline instance. Does not include code for custom component implementation.

Return type

str

Raises
  • ValueError – If element is not a pipeline, or if the pipeline is nonlinear.

  • ValueError – If features in features_path do not match the features on the pipeline.

evalml.pipelines.utils.generate_pipeline_example(pipeline, path_to_train, path_to_holdout, target, path_to_features=None, path_to_mapping='', output_file_path=None)[source]#

Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline.

Parameters
  • pipeline (pipeline instance) – The instance of the pipeline to generate string Python code.

  • path_to_train (str) – path to training data.

  • path_to_holdout (str) – path to holdout data.

  • target (str) – target variable.

  • path_to_features (str) – path to features json. Defaults to None.

  • path_to_mapping (str) – path to mapping json. Defaults to None.

  • output_file_path (str) – path to output python file. Defaults to None.

Returns

String representation of Python code that can be run separately in order to recreate the pipeline instance. Does not include code for custom component implementation.

Return type

str

evalml.pipelines.utils.get_actions_from_option_defaults(action_options)[source]#

Returns a list of actions based on the defaults parameters of each option in the input DataCheckActionOption list.

Parameters

action_options (list[DataCheckActionOption]) – List of DataCheckActionOption objects

Returns

List of actions based on the defaults parameters of each option in the input list.

Return type

list[DataCheckAction]

evalml.pipelines.utils.make_pipeline(X, y, estimator, problem_type, parameters=None, sampler_name=None, extra_components_before=None, extra_components_after=None, use_estimator=True, known_in_advance=None, features=False, exclude_featurizers=None, include_decomposer=True)[source]#

Given input data, target data, an estimator class and the problem type, generates a pipeline class with a preprocessing chain which was recommended based on the inputs. The pipeline will be a subclass of the appropriate pipeline base class for the specified problem_type.

Parameters
  • X (pd.DataFrame) – The input data of shape [n_samples, n_features].

  • y (pd.Series) – The target data of length [n_samples].

  • estimator (Estimator) – Estimator for pipeline.

  • problem_type (ProblemTypes or str) – Problem type for pipeline to generate.

  • parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters.

  • sampler_name (str) – The name of the sampler component to add to the pipeline. Only used in classification problems. Defaults to None

  • extra_components_before (list[ComponentBase]) – List of extra components to be added before preprocessing components. Defaults to None.

  • extra_components_after (list[ComponentBase]) – List of extra components to be added after preprocessing components. Defaults to None.

  • use_estimator (bool) – Whether to add the provided estimator to the pipeline or not. Defaults to True.

  • known_in_advance (list[str], None) – List of features that are known in advance.

  • features (bool) – Whether to add a DFSTransformer component to this pipeline.

  • exclude_featurizers (list[str]) – A list of featurizer components to exclude from the pipeline. Valid options are “DatetimeFeaturizer”, “EmailFeaturizer”, “URLFeaturizer”, “NaturalLanguageFeaturizer”, “TimeSeriesFeaturizer”

  • include_decomposer (bool) – For time series regression problems, whether or not to include a decomposer in the generated pipeline. Defaults to True.

Returns

PipelineBase instance with dynamically generated preprocessing components and specified estimator.

Return type

PipelineBase object

Raises

ValueError – If estimator is not valid for the given problem type, or sampling is not supported for the given problem type.

evalml.pipelines.utils.make_pipeline_from_actions(problem_type, actions, problem_configuration=None)[source]#

Creates a pipeline of components to address the input DataCheckAction list.

Parameters
  • problem_type (str or ProblemType) – The problem type that the pipeline should address.

  • actions (list[DataCheckAction]) – List of DataCheckAction objects used to create list of components

  • problem_configuration (dict) – Required for time series problem types. Values should be passed in for time_index, gap, forecast_horizon, and max_delay.

Returns

Pipeline which can be used to address data check actions.

Return type

PipelineBase

evalml.pipelines.utils.make_pipeline_from_data_check_output(problem_type, data_check_output, problem_configuration=None)[source]#

Creates a pipeline of components to address warnings and errors output from running data checks. Uses all default suggestions.

Parameters
  • problem_type (str or ProblemType) – The problem type.

  • data_check_output (dict) – Output from calling DataCheck.validate().

  • problem_configuration (dict) – Required for time series problem types. Values should be passed in for time_index, gap, forecast_horizon, and max_delay.

Returns

Pipeline which can be used to address data check outputs.

Return type

PipelineBase

Raises

ValueError – If problem_type is of type time series but an incorrect problem_configuration has been passed.

evalml.pipelines.utils.make_timeseries_baseline_pipeline(problem_type, gap, forecast_horizon, time_index, exclude_featurizer=False)[source]#

Make a baseline pipeline for time series regression problems.

Parameters
  • problem_type – One of TIME_SERIES_REGRESSION, TIME_SERIES_MULTICLASS, TIME_SERIES_BINARY

  • gap (int) – Non-negative gap parameter.

  • forecast_horizon (int) – Positive forecast_horizon parameter.

  • time_index (str) – Column name of time_index parameter.

  • exclude_featurizer (bool) – Whether or not to exclude the TimeSeriesFeaturizer from the baseline graph. Defaults to False.

Returns

TimeSeriesPipelineBase, a time series pipeline corresponding to the problem type.

evalml.pipelines.utils.rows_of_interest(pipeline, X, y=None, threshold=None, epsilon=0.1, sort_values=True, types='all')[source]#

Get the row indices of the data that are closest to the threshold. Works only for binary classification problems and pipelines.

Parameters
  • pipeline (PipelineBase) – The fitted binary pipeline.

  • X (ww.DataTable, pd.DataFrame) – The input features to predict on.

  • y (ww.DataColumn, pd.Series, None) – The input target data, if available. Defaults to None.

  • threshold (float) – The threshold value of interest to separate positive and negative predictions. If None, uses the pipeline threshold if set, else 0.5. Defaults to None.

  • epsilon (epsilon) – The difference between the probability and the threshold that would make the row interesting for us. For instance, epsilon=0.1 and threhsold=0.5 would mean we consider all rows in [0.4, 0.6] to be of interest. Defaults to 0.1.

  • sort_values (bool) – Whether to return the indices sorted by the distance from the threshold, such that the first values are closer to the threshold and the later values are further. Defaults to True.

  • types (str) –

    The type of rows to keep and return. Can be one of [‘incorrect’, ‘correct’, ‘true_positive’, ‘true_negative’, ‘all’]. Defaults to ‘all’.

    ’incorrect’ - return only the rows where the predictions are incorrect. This means that, given the threshold and target y, keep only the rows which are labeled wrong. ‘correct’ - return only the rows where the predictions are correct. This means that, given the threshold and target y, keep only the rows which are correctly labeled. ‘true_positive’ - return only the rows which are positive, as given by the targets. ‘true_negative’ - return only the rows which are negative, as given by the targets. ‘all’ - return all rows. This is the only option available when there is no target data provided.

Returns

The indices corresponding to the rows of interest.

Raises
  • ValueError – If pipeline is not a fitted Binary Classification pipeline.

  • ValueError – If types is invalid or y is not provided when types is not ‘all’.

  • ValueError – If the threshold is provided and is exclusive of [0, 1].