utils

Utilities useful in AutoML.

Module Contents

Functions

check_all_pipeline_names_unique

Checks whether all the pipeline names are unique.

get_best_sampler_for_data

Returns the name of the sampler component to use for AutoMLSearch.

get_default_primary_search_objective

Get the default primary search objective for a problem type.

get_pipelines_from_component_graphs

Returns created pipelines from passed component graphs based on the specified problem type.

make_data_splitter

Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.

tune_binary_threshold

Tunes the threshold of a binary pipeline to the X and y thresholding data.

Attributes Summary

AutoMLConfig

Contents

evalml.automl.utils.AutoMLConfig
evalml.automl.utils.check_all_pipeline_names_unique(pipelines)[source]

Checks whether all the pipeline names are unique.

Parameters

pipelines (list[PipelineBase]) – List of pipelines to check if all names are unique.

Raises

ValueError – If any pipeline names are duplicated.

evalml.automl.utils.get_best_sampler_for_data(X, y, sampler_method, sampler_balanced_ratio)[source]

Returns the name of the sampler component to use for AutoMLSearch.

Parameters
  • X (pd.DataFrame) – The input feature data

  • y (pd.Series) – The input target data

  • sampler_method (str) – The sampler_type argument passed to AutoMLSearch

  • sampler_balanced_ratio (float) – The ratio of min:majority targets that we would consider balanced, or should balance the classes to.

Returns

The string name of the sampling component to use, or None if no sampler is necessary

Return type

str, None

evalml.automl.utils.get_default_primary_search_objective(problem_type)[source]

Get the default primary search objective for a problem type.

Parameters

problem_type (str or ProblemType) – Problem type of interest.

Returns

primary objective instance for the problem type.

Return type

ObjectiveBase

evalml.automl.utils.get_pipelines_from_component_graphs(component_graphs_dict, problem_type, parameters=None, random_seed=0)[source]

Returns created pipelines from passed component graphs based on the specified problem type.

Parameters
  • component_graphs_dict (dict) – The dict of component graphs.

  • problem_type (str or ProblemType) – The problem type for which pipelines will be created.

  • parameters (dict) – Pipeline-level parameters that should be passed to the proposed pipelines. Defaults to None.

  • random_seed (int) – Random seed. Defaults to 0.

Returns

List of pipelines made from the passed component graphs.

Return type

list

evalml.automl.utils.make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=3, shuffle=True, random_seed=0)[source]

Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features].

  • y (pd.Series) – The target training data of length [n_samples].

  • problem_type (ProblemType) – The type of machine learning problem.

  • problem_configuration (dict, None) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the time_index, gap, and max_delay variables. Defaults to None.

  • n_splits (int, None) – The number of CV splits, if applicable. Defaults to 3.

  • shuffle (bool) – Whether or not to shuffle the data before splitting, if applicable. Defaults to True.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

Data splitting method.

Return type

sklearn.model_selection.BaseCrossValidator

Raises

ValueError – If problem_configuration is not given for a time-series problem.

evalml.automl.utils.tune_binary_threshold(pipeline, objective, problem_type, X_threshold_tuning, y_threshold_tuning, X=None, y=None)[source]

Tunes the threshold of a binary pipeline to the X and y thresholding data.

Parameters
  • pipeline (Pipeline) – Pipeline instance to threshold.

  • objective (ObjectiveBase) – The objective we want to tune with. If not tuneable and best_pipeline is True, will use F1.

  • problem_type (ProblemType) – The problem type of the pipeline.

  • X_threshold_tuning (pd.DataFrame) – Features to which the pipeline will be tuned.

  • y_threshold_tuning (pd.Series) – Target data to which the pipeline will be tuned.

  • X (pd.DataFrame) – Features to which the pipeline will be trained (used for time series binary). Defaults to None.

  • y (pd.Series) – Target to which the pipeline will be trained (used for time series binary). Defaults to None.