automl_algorithm ======================================== .. py:module:: evalml.automl.automl_algorithm .. autoapi-nested-parse:: AutoML algorithms that power EvalML. Submodules ---------- .. toctree:: :titlesonly: :maxdepth: 1 automl_algorithm/index.rst default_algorithm/index.rst iterative_algorithm/index.rst Package Contents ---------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.automl.automl_algorithm.AutoMLAlgorithm evalml.automl.automl_algorithm.DefaultAlgorithm evalml.automl.automl_algorithm.IterativeAlgorithm Exceptions Summary ~~~~~~~~~~~~~~~~~~ .. autoapisummary:: `evalml.automl.automl_algorithm.AutoMLAlgorithmException` Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: AutoMLAlgorithm(allowed_pipelines=None, allowed_model_families=None, excluded_model_families=None, allowed_component_graphs=None, search_parameters=None, tuner_class=None, text_in_ensembling=False, random_seed=0, n_jobs=-1) Base class for the AutoML algorithms which power EvalML. This class represents an automated machine learning (AutoML) algorithm. It encapsulates the decision-making logic behind an automl search, by both deciding which pipelines to evaluate next and by deciding what set of parameters to configure the pipeline with. To use this interface, you must define a next_batch method which returns the next group of pipelines to evaluate on the training data. That method may access state and results recorded from the previous batches, although that information is not tracked in a general way in this base class. Overriding add_result is a convenient way to record pipeline evaluation info if necessary. :param allowed_pipelines: A list of PipelineBase subclasses indicating the pipelines allowed in the search. The default of None indicates all pipelines for this problem type are allowed. :type allowed_pipelines: list(class) :param search_parameters: Search parameter ranges specified for pipelines to iterate over. :type search_parameters: dict :param tuner_class: A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used. :type tuner_class: class :param text_in_ensembling: If True and ensembling is True, then n_jobs will be set to 1 to avoid downstream sklearn stacking issues related to nltk. Defaults to None. :type text_in_ensembling: boolean :param random_seed: Seed for the random number generator. Defaults to 0. :type random_seed: int **Methods** .. autoapisummary:: :nosignatures: evalml.automl.automl_algorithm.AutoMLAlgorithm.add_result evalml.automl.automl_algorithm.AutoMLAlgorithm.batch_number evalml.automl.automl_algorithm.AutoMLAlgorithm.default_max_batches evalml.automl.automl_algorithm.AutoMLAlgorithm.next_batch evalml.automl.automl_algorithm.AutoMLAlgorithm.num_pipelines_per_batch evalml.automl.automl_algorithm.AutoMLAlgorithm.pipeline_number .. py:method:: add_result(self, score_to_minimize, pipeline, trained_pipeline_results) Register results from evaluating a pipeline. :param score_to_minimize: The score obtained by this pipeline on the primary objective, converted so that lower values indicate better pipelines. :type score_to_minimize: float :param pipeline: The trained pipeline object which was used to compute the score. :type pipeline: PipelineBase :param trained_pipeline_results: Results from training a pipeline. :type trained_pipeline_results: dict :raises PipelineNotFoundError: If pipeline is not allowed in search. .. py:method:: batch_number(self) :property: Returns the number of batches which have been recommended so far. .. py:method:: default_max_batches(self) :property: Returns the number of max batches AutoMLSearch should run by default. .. py:method:: next_batch(self) :abstractmethod: Get the next batch of pipelines to evaluate. :returns: A list of instances of PipelineBase subclasses, ready to be trained and evaluated. :rtype: list[PipelineBase] .. py:method:: num_pipelines_per_batch(self, batch_number) :abstractmethod: Return the number of pipelines in the nth batch. :param batch_number: which batch to calculate the number of pipelines for. :type batch_number: int :returns: number of pipelines in the given batch. :rtype: int .. py:method:: pipeline_number(self) :property: Returns the number of pipelines which have been recommended so far. .. py:exception:: AutoMLAlgorithmException Exception raised when an error is encountered during the computation of the automl algorithm. .. py:class:: DefaultAlgorithm(X, y, problem_type, sampler_name, allowed_model_families=None, excluded_model_families=None, tuner_class=None, random_seed=0, search_parameters=None, n_jobs=1, text_in_ensembling=False, top_n=3, ensembling=False, num_long_explore_pipelines=50, num_long_pipelines_per_batch=10, allow_long_running_models=False, features=None, verbose=False, exclude_featurizers=None) An automl algorithm that consists of two modes: fast and long, where fast is a subset of long. 1. Naive pipelines: a. run baseline with default preprocessing pipeline b. run naive linear model with default preprocessing pipeline c. run basic RF pipeline with default preprocessing pipeline 2. Naive pipelines with feature selection a. subsequent pipelines will use the selected features with a SelectedColumns transformer At this point we have a single pipeline candidate for preprocessing and feature selection 3. Pipelines with preprocessing components: a. scan rest of estimators (our current batch 1). 4. First ensembling run Fast mode ends here. Begin long mode. 6. Run top 3 estimators: a. Generate 50 random parameter sets. Run all 150 in one batch 7. Second ensembling run 8. Repeat these indefinitely until stopping criterion is met: a. For each of the previous top 3 estimators, sample 10 parameters from the tuner. Run all 30 in one batch b. Run ensembling :param X: Training data. :type X: pd.DataFrame :param y: Target data. :type y: pd.Series :param problem_type: Problem type associated with training data. :type problem_type: ProblemType :param sampler_name: Sampler to use for preprocessing. :type sampler_name: BaseSampler :param tuner_class: A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used. :type tuner_class: class :param random_seed: Seed for the random number generator. Defaults to 0. :type random_seed: int :param search_parameters: Pipeline-level parameters and custom hyperparameter ranges specified for pipelines to iterate over. Hyperparameter ranges must be passed in as skopt.space objects. Defaults to None. :type search_parameters: dict or None :param n_jobs: Non-negative integer describing level of parallelism used for pipelines. Defaults to -1. :type n_jobs: int or None :param text_in_ensembling: If True and ensembling is True, then n_jobs will be set to 1 to avoid downstream sklearn stacking issues related to nltk. Defaults to False. :type text_in_ensembling: boolean :param top_n: top n number of pipelines to use for long mode. :type top_n: int :param num_long_explore_pipelines: number of pipelines to explore for each top n pipeline at the start of long mode. :type num_long_explore_pipelines: int :param num_long_pipelines_per_batch: number of pipelines per batch for each top n pipeline through long mode. :type num_long_pipelines_per_batch: int :param allow_long_running_models: Whether or not to allow longer-running models for large multiclass problems. If False and no pipelines, component graphs, or model families are provided, AutoMLSearch will not use Elastic Net or XGBoost when there are more than 75 multiclass targets and will not use CatBoost when there are more than 150 multiclass targets. Defaults to False. :type allow_long_running_models: bool :param features: List of features to run DFS on in AutoML pipelines. Defaults to None. Features will only be computed if the columns used by the feature exist in the input and if the feature has not been computed yet. :type features: list :param verbose: Whether or not to display logging information regarding pipeline building. Defaults to False. :type verbose: boolean :param exclude_featurizers: A list of featurizer components to exclude from the pipelines built by DefaultAlgorithm. Valid options are "DatetimeFeaturizer", "EmailFeaturizer", "URLFeaturizer", "NaturalLanguageFeaturizer", "TimeSeriesFeaturizer" :type exclude_featurizers: list[str] :param allowed_model_families: The model families to search. The default of None searches over all model families. Run evalml.pipelines.components.utils.allowed_model_families("binary") to see options. Change `binary` to `multiclass` or `regression` depending on the problem type. :type allowed_model_families: list(str, ModelFamily) :param excluded_model_families: A list of model families to exclude from the estimators used when building pipelines. For default algorithm, this only excludes estimators in the non-naive batches. :type excluded_model_families: list[ModelFamily] **Methods** .. autoapisummary:: :nosignatures: evalml.automl.automl_algorithm.DefaultAlgorithm.add_result evalml.automl.automl_algorithm.DefaultAlgorithm.batch_number evalml.automl.automl_algorithm.DefaultAlgorithm.default_max_batches evalml.automl.automl_algorithm.DefaultAlgorithm.next_batch evalml.automl.automl_algorithm.DefaultAlgorithm.num_pipelines_per_batch evalml.automl.automl_algorithm.DefaultAlgorithm.pipeline_number .. py:method:: add_result(self, score_to_minimize, pipeline, trained_pipeline_results, cached_data=None) Register results from evaluating a pipeline. In batch number 2, the selected column names from the feature selector are taken to be used in a column selector. Information regarding the best pipeline is updated here as well. :param score_to_minimize: The score obtained by this pipeline on the primary objective, converted so that lower values indicate better pipelines. :type score_to_minimize: float :param pipeline: The trained pipeline object which was used to compute the score. :type pipeline: PipelineBase :param trained_pipeline_results: Results from training a pipeline. :type trained_pipeline_results: dict :param cached_data: A dictionary of cached data, where the keys are the model family. Expected to be of format {model_family: {hash1: trained_component_graph, hash2: trained_component_graph...}...}. Defaults to None. :type cached_data: dict .. py:method:: batch_number(self) :property: Returns the number of batches which have been recommended so far. .. py:method:: default_max_batches(self) :property: Returns the number of max batches AutoMLSearch should run by default. .. py:method:: next_batch(self) Get the next batch of pipelines to evaluate. :returns: a list of instances of PipelineBase subclasses, ready to be trained and evaluated. :rtype: list(PipelineBase) .. py:method:: num_pipelines_per_batch(self, batch_number) Return the number of pipelines in the nth batch. :param batch_number: which batch to calculate the number of pipelines for. :type batch_number: int :returns: number of pipelines in the given batch. :rtype: int .. py:method:: pipeline_number(self) :property: Returns the number of pipelines which have been recommended so far. .. py:class:: IterativeAlgorithm(X, y, problem_type, sampler_name=None, allowed_model_families=None, excluded_model_families=None, allowed_component_graphs=None, max_batches=None, max_iterations=None, tuner_class=None, random_seed=0, pipelines_per_batch=5, n_jobs=-1, number_features=None, ensembling=False, text_in_ensembling=False, search_parameters=None, _estimator_family_order=None, allow_long_running_models=False, features=None, verbose=False, exclude_featurizers=None) An automl algorithm which first fits a base round of pipelines with default parameters, then does a round of parameter tuning on each pipeline in order of performance. :param X: Training data. :type X: pd.DataFrame :param y: Target data. :type y: pd.Series :param problem_type: Problem type associated with training data. :type problem_type: ProblemType :param sampler_name: Sampler to use for preprocessing. Defaults to None. :type sampler_name: BaseSampler :param allowed_model_families: The model families to search. The default of None searches over all model families. Run evalml.pipelines.components.utils.allowed_model_families("binary") to see options. Change `binary` to `multiclass` or `regression` depending on the problem type. Note that if allowed_pipelines is provided, this parameter will be ignored. :type allowed_model_families: list(str, ModelFamily) :param excluded_model_families: A list of model families to exclude from the estimators used when building pipelines. :type excluded_model_families: list[ModelFamily] :param allowed_component_graphs: A dictionary of lists or ComponentGraphs indicating the component graphs allowed in the search. The format should follow { "Name_0": [list_of_components], "Name_1": [ComponentGraph(...)] } The default of None indicates all pipeline component graphs for this problem type are allowed. Setting this field will cause allowed_model_families to be ignored. e.g. allowed_component_graphs = { "My_Graph": ["Imputer", "One Hot Encoder", "Random Forest Classifier"] } :type allowed_component_graphs: dict :param max_batches: The maximum number of batches to be evaluated. Used to determine ensembling. Defaults to None. :type max_batches: int :param max_iterations: The maximum number of iterations to be evaluated. Used to determine ensembling. Defaults to None. :type max_iterations: int :param tuner_class: A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used. :type tuner_class: class :param random_seed: Seed for the random number generator. Defaults to 0. :type random_seed: int :param pipelines_per_batch: The number of pipelines to be evaluated in each batch, after the first batch. Defaults to 5. :type pipelines_per_batch: int :param n_jobs: Non-negative integer describing level of parallelism used for pipelines. Defaults to None. :type n_jobs: int or None :param number_features: The number of columns in the input features. Defaults to None. :type number_features: int :param ensembling: If True, runs ensembling in a separate batch after every allowed pipeline class has been iterated over. Defaults to False. :type ensembling: boolean :param text_in_ensembling: If True and ensembling is True, then n_jobs will be set to 1 to avoid downstream sklearn stacking issues related to nltk. Defaults to False. :type text_in_ensembling: boolean :param search_parameters: Pipeline-level parameters and custom hyperparameter ranges specified for pipelines to iterate over. Hyperparameter ranges must be passed in as skopt.space objects. Defaults to None. :type search_parameters: dict or None :param _estimator_family_order: specify the sort order for the first batch. Defaults to None, which uses _ESTIMATOR_FAMILY_ORDER. :type _estimator_family_order: list(ModelFamily) or None :param allow_long_running_models: Whether or not to allow longer-running models for large multiclass problems. If False and no pipelines, component graphs, or model families are provided, AutoMLSearch will not use Elastic Net or XGBoost when there are more than 75 multiclass targets and will not use CatBoost when there are more than 150 multiclass targets. Defaults to False. :type allow_long_running_models: bool :param features: List of features to run DFS on in AutoML pipelines. Defaults to None. Features will only be computed if the columns used by the feature exist in the input and if the feature itself is not in input. :type features: list :param verbose: Whether or not to display logging information regarding pipeline building. Defaults to False. :type verbose: boolean :param exclude_featurizers: A list of featurizer components to exclude from the pipelines built by IterativeAlgorithm. Valid options are "DatetimeFeaturizer", "EmailFeaturizer", "URLFeaturizer", "NaturalLanguageFeaturizer", "TimeSeriesFeaturizer" :type exclude_featurizers: list[str] **Methods** .. autoapisummary:: :nosignatures: evalml.automl.automl_algorithm.IterativeAlgorithm.add_result evalml.automl.automl_algorithm.IterativeAlgorithm.batch_number evalml.automl.automl_algorithm.IterativeAlgorithm.default_max_batches evalml.automl.automl_algorithm.IterativeAlgorithm.next_batch evalml.automl.automl_algorithm.IterativeAlgorithm.num_pipelines_per_batch evalml.automl.automl_algorithm.IterativeAlgorithm.pipeline_number .. py:method:: add_result(self, score_to_minimize, pipeline, trained_pipeline_results, cached_data=None) Register results from evaluating a pipeline. :param score_to_minimize: The score obtained by this pipeline on the primary objective, converted so that lower values indicate better pipelines. :type score_to_minimize: float :param pipeline: The trained pipeline object which was used to compute the score. :type pipeline: PipelineBase :param trained_pipeline_results: Results from training a pipeline. :type trained_pipeline_results: dict :param cached_data: A dictionary of cached data, where the keys are the model family. Expected to be of format {model_family: {hash1: trained_component_graph, hash2: trained_component_graph...}...}. Defaults to None. :type cached_data: dict :raises ValueError: If default parameters are not in the acceptable hyperparameter ranges. .. py:method:: batch_number(self) :property: Returns the number of batches which have been recommended so far. .. py:method:: default_max_batches(self) :property: Returns the number of max batches AutoMLSearch should run by default. .. py:method:: next_batch(self) Get the next batch of pipelines to evaluate. :returns: A list of instances of PipelineBase subclasses, ready to be trained and evaluated. :rtype: list[PipelineBase] :raises AutoMLAlgorithmException: If no results were reported from the first batch. .. py:method:: num_pipelines_per_batch(self, batch_number) Return the number of pipelines in the nth batch. :param batch_number: which batch to calculate the number of pipelines for. :type batch_number: int :returns: number of pipelines in the given batch. :rtype: int .. py:method:: pipeline_number(self) :property: Returns the number of pipelines which have been recommended so far.