utils#
Utilities useful in AutoML.
Module Contents#
Functions#
Checks whether all the pipeline names are unique. |
|
Returns the name of the sampler component to use for AutoMLSearch. |
|
Get the default primary search objective for a problem type. |
|
Returns created pipelines from passed component graphs based on the specified problem type. |
|
Determine for a given automl config and pipeline what the threshold tuning objective should be and whether or not training data should be further split to achieve proper threshold tuning. |
|
Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search. |
|
Further split the training data for a given pipeline. This is needed for binary pipelines in order to properly tune the threshold. |
|
Tunes the threshold of a binary pipeline to the X and y thresholding data. |
Attributes Summary#
Contents#
- evalml.automl.utils.AutoMLConfig#
- evalml.automl.utils.check_all_pipeline_names_unique(pipelines)[source]#
Checks whether all the pipeline names are unique.
- Parameters
pipelines (list[PipelineBase]) – List of pipelines to check if all names are unique.
- Raises
ValueError – If any pipeline names are duplicated.
- evalml.automl.utils.get_best_sampler_for_data(X, y, sampler_method, sampler_balanced_ratio)[source]#
Returns the name of the sampler component to use for AutoMLSearch.
- Parameters
X (pd.DataFrame) – The input feature data
y (pd.Series) – The input target data
sampler_method (str) – The sampler_type argument passed to AutoMLSearch
sampler_balanced_ratio (float) – The ratio of min:majority targets that we would consider balanced, or should balance the classes to.
- Returns
The string name of the sampling component to use, or None if no sampler is necessary
- Return type
str, None
- evalml.automl.utils.get_default_primary_search_objective(problem_type)[source]#
Get the default primary search objective for a problem type.
- Parameters
problem_type (str or ProblemType) – Problem type of interest.
- Returns
primary objective instance for the problem type.
- Return type
ObjectiveBase
- evalml.automl.utils.get_pipelines_from_component_graphs(component_graphs_dict, problem_type, parameters=None, random_seed=0)[source]#
Returns created pipelines from passed component graphs based on the specified problem type.
- Parameters
component_graphs_dict (dict) – The dict of component graphs.
problem_type (str or ProblemType) – The problem type for which pipelines will be created.
parameters (dict) – Pipeline-level parameters that should be passed to the proposed pipelines. Defaults to None.
random_seed (int) – Random seed. Defaults to 0.
- Returns
List of pipelines made from the passed component graphs.
- Return type
list
- evalml.automl.utils.get_threshold_tuning_info(automl_config, pipeline)[source]#
Determine for a given automl config and pipeline what the threshold tuning objective should be and whether or not training data should be further split to achieve proper threshold tuning.
Can also be used after automl search has been performed to determine whether the full training data was used to train the pipeline.
- Parameters
automl_config (AutoMLConfig) – The AutoMLSearch’s config object. Used to determine threshold tuning objective and whether data needs resplitting.
pipeline (Pipeline) – The pipeline instance to Threshold.
- Returns
threshold_tuning_objective, data_needs_resplitting (str, bool)
- evalml.automl.utils.make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=3, shuffle=True, random_seed=0)[source]#
Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.
- Parameters
X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].
problem_type (ProblemType) – The type of machine learning problem.
problem_configuration (dict, None) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the time_index, gap, and max_delay variables. Defaults to None.
n_splits (int, None) – The number of CV splits, if applicable. Defaults to 3.
shuffle (bool) – Whether or not to shuffle the data before splitting, if applicable. Defaults to True.
random_seed (int) – Seed for the random number generator. Defaults to 0.
- Returns
Data splitting method.
- Return type
sklearn.model_selection.BaseCrossValidator
- Raises
ValueError – If problem_configuration is not given for a time-series problem.
- evalml.automl.utils.resplit_training_data(pipeline, X_train, y_train)[source]#
Further split the training data for a given pipeline. This is needed for binary pipelines in order to properly tune the threshold.
Can be used after automl search has been performed to recreate the data that was used to train a pipeline.
- Parameters
pipeline (PipelineBase) – the pipeline whose training data we are splitting
X_train (pd.DataFrame or np.ndarray) – training data of shape [n_samples, n_features]
y_train (pd.Series, or np.ndarray) – training target data of length [n_samples]
- Returns
Feature and target data each split into train and threshold tuning sets.
- Return type
pd.DataFrame, pd.DataFrame, pd.Series, pd.Series
- evalml.automl.utils.tune_binary_threshold(pipeline, objective, problem_type, X_threshold_tuning, y_threshold_tuning, X=None, y=None)[source]#
Tunes the threshold of a binary pipeline to the X and y thresholding data.
- Parameters
pipeline (Pipeline) – Pipeline instance to threshold.
objective (ObjectiveBase) – The objective we want to tune with. If not tuneable and best_pipeline is True, will use F1.
problem_type (ProblemType) – The problem type of the pipeline.
X_threshold_tuning (pd.DataFrame) – Features to which the pipeline will be tuned.
y_threshold_tuning (pd.Series) – Target data to which the pipeline will be tuned.
X (pd.DataFrame) – Features to which the pipeline will be trained (used for time series binary). Defaults to None.
y (pd.Series) – Target to which the pipeline will be trained (used for time series binary). Defaults to None.