iterative_algorithm#

An automl algorithm which first fits a base round of pipelines with default parameters, then does a round of parameter tuning on each pipeline in order of performance.

Module Contents#

Classes Summary#

IterativeAlgorithm

An automl algorithm which first fits a base round of pipelines with default parameters, then does a round of parameter tuning on each pipeline in order of performance.

Contents#

class evalml.automl.automl_algorithm.iterative_algorithm.IterativeAlgorithm(X, y, problem_type, sampler_name=None, allowed_model_families=None, excluded_model_families=None, allowed_component_graphs=None, max_batches=None, max_iterations=None, tuner_class=None, random_seed=0, pipelines_per_batch=5, n_jobs=- 1, number_features=None, ensembling=False, text_in_ensembling=False, search_parameters=None, _estimator_family_order=None, allow_long_running_models=False, features=None, verbose=False, exclude_featurizers=None)[source]#

An automl algorithm which first fits a base round of pipelines with default parameters, then does a round of parameter tuning on each pipeline in order of performance.

Parameters
  • X (pd.DataFrame) – Training data.

  • y (pd.Series) – Target data.

  • problem_type (ProblemType) – Problem type associated with training data.

  • sampler_name (BaseSampler) – Sampler to use for preprocessing. Defaults to None.

  • allowed_model_families (list(str, ModelFamily)) – The model families to search. The default of None searches over all model families. Run evalml.pipelines.components.utils.allowed_model_families(“binary”) to see options. Change binary to multiclass or regression depending on the problem type. Note that if allowed_pipelines is provided, this parameter will be ignored.

  • excluded_model_families (list(str, ModelFamily)) – A list of model families to exclude from the estimators used when building pipelines.

  • allowed_component_graphs (dict) –

    A dictionary of lists or ComponentGraphs indicating the component graphs allowed in the search. The format should follow { “Name_0”: [list_of_components], “Name_1”: [ComponentGraph(…)] }

    The default of None indicates all pipeline component graphs for this problem type are allowed. Setting this field will cause allowed_model_families to be ignored.

    e.g. allowed_component_graphs = { “My_Graph”: [“Imputer”, “One Hot Encoder”, “Random Forest Classifier”] }

  • max_batches (int) – The maximum number of batches to be evaluated. Used to determine ensembling. Defaults to None.

  • max_iterations (int) – The maximum number of iterations to be evaluated. Used to determine ensembling. Defaults to None.

  • tuner_class (class) – A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

  • pipelines_per_batch (int) – The number of pipelines to be evaluated in each batch, after the first batch. Defaults to 5.

  • n_jobs (int or None) – Non-negative integer describing level of parallelism used for pipelines. Defaults to None.

  • number_features (int) – The number of columns in the input features. Defaults to None.

  • ensembling (boolean) – If True, runs ensembling in a separate batch after every allowed pipeline class has been iterated over. Defaults to False.

  • text_in_ensembling (boolean) – If True and ensembling is True, then n_jobs will be set to 1 to avoid downstream sklearn stacking issues related to nltk. Defaults to False.

  • search_parameters (dict or None) – Pipeline-level parameters and custom hyperparameter ranges specified for pipelines to iterate over. Hyperparameter ranges must be passed in as skopt.space objects. Defaults to None.

  • _estimator_family_order (list(ModelFamily) or None) – specify the sort order for the first batch. Defaults to None, which uses _ESTIMATOR_FAMILY_ORDER.

  • allow_long_running_models (bool) – Whether or not to allow longer-running models for large multiclass problems. If False and no pipelines, component graphs, or model families are provided, AutoMLSearch will not use Elastic Net or XGBoost when there are more than 75 multiclass targets and will not use CatBoost when there are more than 150 multiclass targets. Defaults to False.

  • features (list) – List of features to run DFS on in AutoML pipelines. Defaults to None. Features will only be computed if the columns used by the feature exist in the input and if the feature itself is not in input.

  • verbose (boolean) – Whether or not to display logging information regarding pipeline building. Defaults to False.

  • exclude_featurizers (list[str]) – A list of featurizer components to exclude from the pipelines built by IterativeAlgorithm. Valid options are “DatetimeFeaturizer”, “EmailFeaturizer”, “URLFeaturizer”, “NaturalLanguageFeaturizer”, “TimeSeriesFeaturizer”

Methods

add_result

Register results from evaluating a pipeline.

batch_number

Returns the number of batches which have been recommended so far.

default_max_batches

Returns the number of max batches AutoMLSearch should run by default.

next_batch

Get the next batch of pipelines to evaluate.

num_pipelines_per_batch

Return the number of pipelines in the nth batch.

pipeline_number

Returns the number of pipelines which have been recommended so far.

add_result(self, score_to_minimize, pipeline, trained_pipeline_results, cached_data=None)[source]#

Register results from evaluating a pipeline.

Parameters
  • score_to_minimize (float) – The score obtained by this pipeline on the primary objective, converted so that lower values indicate better pipelines.

  • pipeline (PipelineBase) – The trained pipeline object which was used to compute the score.

  • trained_pipeline_results (dict) – Results from training a pipeline.

  • cached_data (dict) – A dictionary of cached data, where the keys are the model family. Expected to be of format {model_family: {hash1: trained_component_graph, hash2: trained_component_graph…}…}. Defaults to None.

Raises

ValueError – If default parameters are not in the acceptable hyperparameter ranges.

property batch_number(self)#

Returns the number of batches which have been recommended so far.

property default_max_batches(self)#

Returns the number of max batches AutoMLSearch should run by default.

next_batch(self)[source]#

Get the next batch of pipelines to evaluate.

Returns

A list of instances of PipelineBase subclasses, ready to be trained and evaluated.

Return type

list[PipelineBase]

Raises

AutoMLAlgorithmException – If no results were reported from the first batch.

num_pipelines_per_batch(self, batch_number)[source]#

Return the number of pipelines in the nth batch.

Parameters

batch_number (int) – which batch to calculate the number of pipelines for.

Returns

number of pipelines in the given batch.

Return type

int

property pipeline_number(self)#

Returns the number of pipelines which have been recommended so far.