utils#
Utilities useful in AutoML.
Module Contents#
Functions#
Checks whether all the pipeline names are unique. |
|
Returns the name of the sampler component to use for AutoMLSearch. |
|
Get the default primary search objective for a problem type. |
|
Returns created pipelines from passed component graphs based on the specified problem type. |
|
Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search. |
|
Tunes the threshold of a binary pipeline to the X and y thresholding data. |
Attributes Summary#
Contents#
- evalml.automl.utils.AutoMLConfig#
- evalml.automl.utils.check_all_pipeline_names_unique(pipelines)[source]#
Checks whether all the pipeline names are unique.
- Parameters
pipelines (list[PipelineBase]) – List of pipelines to check if all names are unique.
- Raises
ValueError – If any pipeline names are duplicated.
- evalml.automl.utils.get_best_sampler_for_data(X, y, sampler_method, sampler_balanced_ratio)[source]#
Returns the name of the sampler component to use for AutoMLSearch.
- Parameters
X (pd.DataFrame) – The input feature data
y (pd.Series) – The input target data
sampler_method (str) – The sampler_type argument passed to AutoMLSearch
sampler_balanced_ratio (float) – The ratio of min:majority targets that we would consider balanced, or should balance the classes to.
- Returns
The string name of the sampling component to use, or None if no sampler is necessary
- Return type
str, None
- evalml.automl.utils.get_default_primary_search_objective(problem_type)[source]#
Get the default primary search objective for a problem type.
- Parameters
problem_type (str or ProblemType) – Problem type of interest.
- Returns
primary objective instance for the problem type.
- Return type
ObjectiveBase
- evalml.automl.utils.get_pipelines_from_component_graphs(component_graphs_dict, problem_type, parameters=None, random_seed=0)[source]#
Returns created pipelines from passed component graphs based on the specified problem type.
- Parameters
component_graphs_dict (dict) – The dict of component graphs.
problem_type (str or ProblemType) – The problem type for which pipelines will be created.
parameters (dict) – Pipeline-level parameters that should be passed to the proposed pipelines. Defaults to None.
random_seed (int) – Random seed. Defaults to 0.
- Returns
List of pipelines made from the passed component graphs.
- Return type
list
- evalml.automl.utils.make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=3, shuffle=True, random_seed=0)[source]#
Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.
- Parameters
X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].
problem_type (ProblemType) – The type of machine learning problem.
problem_configuration (dict, None) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the time_index, gap, and max_delay variables. Defaults to None.
n_splits (int, None) – The number of CV splits, if applicable. Defaults to 3.
shuffle (bool) – Whether or not to shuffle the data before splitting, if applicable. Defaults to True.
random_seed (int) – Seed for the random number generator. Defaults to 0.
- Returns
Data splitting method.
- Return type
sklearn.model_selection.BaseCrossValidator
- Raises
ValueError – If problem_configuration is not given for a time-series problem.
- evalml.automl.utils.tune_binary_threshold(pipeline, objective, problem_type, X_threshold_tuning, y_threshold_tuning, X=None, y=None)[source]#
Tunes the threshold of a binary pipeline to the X and y thresholding data.
- Parameters
pipeline (Pipeline) – Pipeline instance to threshold.
objective (ObjectiveBase) – The objective we want to tune with. If not tuneable and best_pipeline is True, will use F1.
problem_type (ProblemType) – The problem type of the pipeline.
X_threshold_tuning (pd.DataFrame) – Features to which the pipeline will be tuned.
y_threshold_tuning (pd.Series) – Target data to which the pipeline will be tuned.
X (pd.DataFrame) – Features to which the pipeline will be trained (used for time series binary). Defaults to None.
y (pd.Series) – Target to which the pipeline will be trained (used for time series binary). Defaults to None.