utils¶
Utilities useful in AutoML.
Module Contents¶
Functions¶
Checks whether all the pipeline names are unique. |
|
Returns the name of the sampler component to use for AutoMLSearch. |
|
Get the default primary search objective for a problem type. |
|
Returns created pipelines from passed component graphs based on the specified problem type. |
|
Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search. |
|
Tunes the threshold of a binary pipeline to the X and y thresholding data. |
Attributes Summary¶
Contents¶
-
evalml.automl.utils.
AutoMLConfig
¶
-
evalml.automl.utils.
check_all_pipeline_names_unique
(pipelines)[source]¶ Checks whether all the pipeline names are unique.
- Parameters
pipelines (list[PipelineBase]) – List of pipelines to check if all names are unique.
- Raises
ValueError – If any pipeline names are duplicated.
-
evalml.automl.utils.
get_best_sampler_for_data
(X, y, sampler_method, sampler_balanced_ratio)[source]¶ Returns the name of the sampler component to use for AutoMLSearch.
- Parameters
X (pd.DataFrame) – The input feature data
y (pd.Series) – The input target data
sampler_method (str) – The sampler_type argument passed to AutoMLSearch
sampler_balanced_ratio (float) – The ratio of min:majority targets that we would consider balanced, or should balance the classes to.
- Returns
The string name of the sampling component to use, or None if no sampler is necessary
- Return type
str, None
-
evalml.automl.utils.
get_default_primary_search_objective
(problem_type)[source]¶ Get the default primary search objective for a problem type.
- Parameters
problem_type (str or ProblemType) – Problem type of interest.
- Returns
primary objective instance for the problem type.
- Return type
ObjectiveBase
-
evalml.automl.utils.
get_pipelines_from_component_graphs
(component_graphs_dict, problem_type, parameters=None, random_seed=0)[source]¶ Returns created pipelines from passed component graphs based on the specified problem type.
- Parameters
component_graphs_dict (dict) – The dict of component graphs.
problem_type (str or ProblemType) – The problem type for which pipelines will be created.
parameters (dict) – Pipeline-level parameters that should be passed to the proposed pipelines. Defaults to None.
random_seed (int) – Random seed. Defaults to 0.
- Returns
List of pipelines made from the passed component graphs.
- Return type
list
-
evalml.automl.utils.
make_data_splitter
(X, y, problem_type, problem_configuration=None, n_splits=3, shuffle=True, random_seed=0)[source]¶ Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.
- Parameters
X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].
problem_type (ProblemType) – The type of machine learning problem.
problem_configuration (dict, None) – Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the date_index, gap, and max_delay variables. Defaults to None.
n_splits (int, None) – The number of CV splits, if applicable. Defaults to 3.
shuffle (bool) – Whether or not to shuffle the data before splitting, if applicable. Defaults to True.
random_seed (int) – Seed for the random number generator. Defaults to 0.
- Returns
Data splitting method.
- Return type
sklearn.model_selection.BaseCrossValidator
- Raises
ValueError – If problem_configuration is not given for a time-series problem.
-
evalml.automl.utils.
tune_binary_threshold
(pipeline, objective, problem_type, X_threshold_tuning, y_threshold_tuning)[source]¶ Tunes the threshold of a binary pipeline to the X and y thresholding data.
- Parameters
pipeline (Pipeline) – Pipeline instance to threshold.
objective (ObjectiveBase) – The objective we want to tune with. If not tuneable and best_pipeline is True, will use F1.
problem_type (ProblemType) – The problem type of the pipeline.
X_threshold_tuning (pd.DataFrame) – Features to tune pipeline to.
y_threshold_tuning (pd.Series) – Target data to tune pipeline to.