{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Automated Machine Learning (AutoML) Search\n", "\n", "## Background\n", "\n", "### Machine Learning\n", "\n", "[Machine learning](https://en.wikipedia.org/wiki/Machine_learning) (ML) is the process of constructing a mathematical model of a system based on a sample dataset collected from that system.\n", "\n", "One of the main goals of training an ML model is to teach the model to separate the signal present in the data from the noise inherent in system and in the data collection process. If this is done effectively, the model can then be used to make accurate predictions about the system when presented with new, similar data. Additionally, introspecting on an ML model can reveal key information about the system being modeled, such as which inputs and transformations of the inputs are most useful to the ML model for learning the signal in the data, and are therefore the most predictive.\n", "\n", "There are [a variety](https://en.wikipedia.org/wiki/Machine_learning#Approaches) of ML problem types. Supervised learning describes the case where the collected data contains an output value to be modeled and a set of inputs with which to train the model. EvalML focuses on training supervised learning models.\n", "\n", "EvalML supports three common supervised ML problem types. The first is regression, where the target value to model is a continuous numeric value. Next are binary and multiclass classification, where the target value to model consists of two or more discrete values or categories. The choice of which supervised ML problem type is most appropriate depends on domain expertise and on how the model will be evaluated and used. \n", "\n", "EvalML is currently building support for supervised time series problems: time series regression, time series binary classification, and time series multiclass classification. While we've added some features to tackle these kinds of problems, our functionality is still being actively developed so please be mindful of that before using it. \n", "\n", "\n", "### AutoML and Search\n", "\n", "[AutoML](https://en.wikipedia.org/wiki/Automated_machine_learning) is the process of automating the construction, training and evaluation of ML models. Given a data and some configuration, AutoML searches for the most effective and accurate ML model or models to fit the dataset. During the search, AutoML will explore different combinations of model type, model parameters and model architecture.\n", "\n", "An effective AutoML solution offers several advantages over constructing and tuning ML models by hand. AutoML can assist with many of the difficult aspects of ML, such as avoiding overfitting and underfitting, imbalanced data, detecting data leakage and other potential issues with the problem setup, and automatically applying best-practice data cleaning, feature engineering, feature selection and various modeling techniques. AutoML can also leverage search algorithms to optimally sweep the hyperparameter search space, resulting in model performance which would be difficult to achieve by manual training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AutoML in EvalML\n", "\n", "EvalML supports all of the above and more.\n", "\n", "In its simplest usage, the AutoML search interface requires only the input data, the target data and a `problem_type` specifying what kind of supervised ML problem to model.\n", "\n", "** Graphing methods, like verbose AutoMLSearch, on Jupyter Notebook and Jupyter Lab require [ipywidgets](https://ipywidgets.readthedocs.io/en/latest/user_install.html) to be installed.\n", "\n", "** If graphing on Jupyter Lab, [jupyterlab-plotly](https://plotly.com/python/getting-started/#jupyterlab-support-python-35) required. To download this, make sure you have [npm](https://nodejs.org/en/download/) installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import evalml\n", "from evalml.utils import infer_feature_types\n", "\n", "X, y = evalml.demos.load_fraud(n_rows=650)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To provide data to EvalML, it is recommended that you initialize a [Woodwork accessor](https://woodwork.alteryx.com/en/stable/) on your data. This allows you to easily control how EvalML will treat each of your features before training a model.\n", "\n", "EvalML also accepts ``pandas`` input, and will run type inference on top of the input ``pandas`` data. If you'd like to change the types inferred by EvalML, you can use the `infer_feature_types` utility method, which takes pandas or numpy input and converts it to a Woodwork data structure. The `feature_types` parameter can be used to specify what types specific columns should be.\n", "\n", "Feature types such as `Natural Language` must be specified in this way, otherwise Woodwork will infer it as `Unknown` type and drop it during the AutoMLSearch.\n", "\n", "In the example below, we reformat a couple features to make them easily consumable by the model, and then specify that the provider, which would have otherwise been inferred as a column with natural language, is a categorical column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X.ww[\"expiration_date\"] = X[\"expiration_date\"].apply(\n", " lambda x: \"20{}-01-{}\".format(x.split(\"/\")[1], x.split(\"/\")[0])\n", ")\n", "X = infer_feature_types(\n", " X,\n", " feature_types={\n", " \"store_id\": \"categorical\",\n", " \"expiration_date\": \"datetime\",\n", " \"lat\": \"categorical\",\n", " \"lng\": \"categorical\",\n", " \"provider\": \"categorical\",\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(\n", " X, y, problem_type=\"binary\", test_size=0.2\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Checks\n", "\n", "Before calling `AutoMLSearch.search`, we should run some sanity checks on our data to ensure that the input data being passed will not run into some common issues before running a potentially time-consuming search. EvalML has various data checks that makes this easy. Each data check will return a collection of warnings and errors if it detects potential issues with the input data. This allows users to inspect their data to avoid confusing errors that may arise during the search process. You can learn about each of the data checks available through our [data checks guide](data_checks.ipynb). \n", "\n", "Here, we will run the `DefaultDataChecks` class, which contains a series of data checks that are generally useful." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.data_checks import DefaultDataChecks\n", "\n", "data_checks = DefaultDataChecks(\"binary\", \"log loss binary\")\n", "data_checks.validate(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since there were no warnings or errors returned, we can safely continue with the search process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Holdout Set for Pipeline Ranking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the `holdout_set_size` parameter is set and the input dataset has more than 500 rows, AutoMLSearch will create a holdout set from `holdout_set_size` of the training data. Alternatively, a holdout set can be manually specified by using the `X_holdout` and `y_holdout` parameters in `AutoMLSearch()`. In this example, the holdout set created previously will be used by AutoML search.\n", "\n", "During the AutoML search process, the mean of the objective scores of all cross validation folds (shown the \"mean_cv_score\" column in the pipeline rankings), is calculated. This score is passed to the AutoML search tuner to further optimize the hyperparameters of the next batch of pipelines.\n", "\n", "After, the pipeline will be fitted on the entire training dataset and scored on this new holdout set. This score is represented under the \"ranking_score\" column on the pipeline rankings board and is used to rank pipeline performance.\n", "\n", "If a dataset has less than 500 rows or `holdout_set_size=0` (which is the default setting), the \"mean_cv_score\" will be used as the ranking_score instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl = evalml.automl.AutoMLSearch(\n", " X_train=X_train,\n", " y_train=y_train,\n", " X_holdout=X_holdout,\n", " y_holdout=y_holdout,\n", " problem_type=\"binary\",\n", " verbose=True,\n", ")\n", "automl.search(interactive_plot=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the `verbose` argument set to True, the AutoML search will log its progress, reporting each pipeline and parameter set evaluated during the search. The search iteration plot shown during AutoML search tracks the current pipeline's validation score (tracked as the gray point) against the best pipeline validation score (tracked as the blue line).\n", "\n", "There are a number of mechanisms to control the AutoML search time. One way is to set the `max_batches` parameter which controls the maximum number of rounds of AutoML to evaluate, where each round may train and score a variable number of pipelines. Another way is to set the `max_iterations` parameter which controls the maximum number of candidate models to be evaluated during AutoML. By default, AutoML will search for a single batch. The first pipeline to be evaluated will always be a baseline model representing a trivial solution. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The AutoML interface supports a variety of other parameters. For a comprehensive list, please [refer to the API reference.](../autoapi/evalml/automl/index.rst#evalml.automl.AutoMLSearch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also provide [a standalone search method](../autoapi/evalml/automl/index.rst#evalml.automl.search) which does all of the above in a single line, and returns the `AutoMLSearch` instance and data check results. If there were data check errors, AutoML will not be run and no `AutoMLSearch` instance will be returned." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detecting Problem Type\n", "\n", "EvalML includes a simple method, `detect_problem_type`, to help determine the problem type given the target data. \n", "\n", "This function can return the predicted problem type as a ProblemType enum, choosing from ProblemType.BINARY, ProblemType.MULTICLASS, and ProblemType.REGRESSION. If the target data is invalid (for instance when there is only 1 unique label), the function will throw an error instead.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from evalml.problem_types import detect_problem_type\n", "\n", "y_binary = pd.Series([0, 1, 1, 0, 1, 1])\n", "detect_problem_type(y_binary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Objective parameter\n", "\n", "AutoMLSearch takes in an `objective` parameter to determine which `objective` to optimize for. By default, this parameter is set to `auto`, which allows AutoML to choose `LogLossBinary` for binary classification problems, `LogLossMulticlass` for multiclass classification problems, and `R2` for regression problems.\n", "\n", "It should be noted that the `objective` parameter is only used in ranking and helping choose the pipelines to iterate over, but is not used to optimize each individual pipeline during fit-time.\n", "\n", "To get the default objective for each problem type, you can use the `get_default_primary_search_objective` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.automl import get_default_primary_search_objective\n", "\n", "binary_objective = get_default_primary_search_objective(\"binary\")\n", "multiclass_objective = get_default_primary_search_objective(\"multiclass\")\n", "regression_objective = get_default_primary_search_objective(\"regression\")\n", "\n", "print(binary_objective.name)\n", "print(multiclass_objective.name)\n", "print(regression_objective.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using custom pipelines\n", "\n", "EvalML's AutoML algorithm generates a set of pipelines to search with. To provide a custom set instead, set allowed_component_graphs to a dictionary of custom component graphs. `AutoMLSearch` will use these to generate `Pipeline` instances. Note: this will prevent AutoML from generating other pipelines to search over." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.pipelines import MulticlassClassificationPipeline\n", "\n", "\n", "automl_custom = evalml.automl.AutoMLSearch(\n", " X_train=X_train,\n", " y_train=y_train,\n", " problem_type=\"multiclass\",\n", " verbose=True,\n", " allowed_component_graphs={\n", " \"My_pipeline\": [\"Simple Imputer\", \"Random Forest Classifier\"],\n", " \"My_other_pipeline\": [\"One Hot Encoder\", \"Random Forest Classifier\"],\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stopping the search early\n", "\n", "To stop the search early, hit `Ctrl-C`. This will bring up a prompt asking for confirmation. Responding with `y` will immediately stop the search. Responding with `n` will continue the search.\n", "\n", "![Interrupting Search Demo](keyboard_interrupt_demo_updated.gif)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Callback functions\n", "\n", "``AutoMLSearch`` supports several callback functions, which can be specified as parameters when initializing an ``AutoMLSearch`` object. They are:\n", "\n", "- ``start_iteration_callback``\n", "- ``add_result_callback``\n", "- ``error_callback``\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Start Iteration Callback\n", "Users can set ``start_iteration_callback`` to set what function is called before each pipeline training iteration. This callback function must take three positional parameters: the pipeline class, the pipeline parameters, and the ``AutoMLSearch`` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## start_iteration_callback example function\n", "def start_iteration_callback_example(pipeline_class, pipeline_params, automl_obj):\n", " print(\"Training pipeline with the following parameters:\", pipeline_params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Add Result Callback\n", "Users can set ``add_result_callback`` to set what function is called after each pipeline training iteration. This callback function must take three positional parameters: a dictionary containing the training results for the new pipeline, an untrained_pipeline containing the parameters used during training, and the ``AutoMLSearch`` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## add_result_callback example function\n", "def add_result_callback_example(pipeline_results_dict, untrained_pipeline, automl_obj):\n", " print(\n", " \"Results for trained pipeline with the following parameters:\",\n", " pipeline_results_dict,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Error Callback\n", "Users can set the ``error_callback`` to set what function called when `search()` errors and raises an ``Exception``. This callback function takes three positional parameters: the ``Exception raised``, the traceback, and the ``AutoMLSearch object``. This callback function must also accept ``kwargs``, so ``AutoMLSearch`` is able to pass along other parameters used by default.\n", "\n", "Evalml defines several error callback functions, which can be found under `evalml.automl.callbacks`. They are:\n", "\n", "- `silent_error_callback`\n", "- `raise_error_callback`\n", "- `log_and_save_error_callback`\n", "- `raise_and_save_error_callback`\n", "- `log_error_callback` (default used when ``error_callback`` is None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# error_callback example; this is implemented in the evalml library\n", "def raise_error_callback(exception, traceback, automl, **kwargs):\n", " \"\"\"Raises the exception thrown by the AutoMLSearch object. Also logs the exception as an error.\"\"\"\n", " logger.error(f\"AutoMLSearch raised a fatal exception: {str(exception)}\")\n", " logger.error(\"\\n\".join(traceback))\n", " raise exception" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## View Rankings\n", "A summary of all the pipelines built can be returned as a pandas DataFrame which is sorted by the validation score.\n", "\n", "- For AutoML searches completed with a holdout set, the validation score is the holdout score of the pipeline fitted using the entire training dataset. \n", "- For AutoML searches completed without a holdout set, the validation score is the average score across all cross-validation folds. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl.rankings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recommendation Score\n", "\n", "If you would like a more robust evaluation of the performance of your models, EvalML additionally provides a recommendation score alongside the selected objective. The recommendation score is a weighted average of a number of default objectives for your problem type, normalized and scaled so that the final score can be interpreted as a percentage from 0 to 100. This weighted score provides a more holistic understanding of model performance, and prioritizes model generalizability rather than one single objective which may not completely serve your use case." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl.get_recommendation_scores(use_pipeline_names=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl.get_recommendation_scores(priority=\"F1\", use_pipeline_names=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see what objectives are included in the recommendation score, you can use:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evalml.objectives.get_default_recommendation_objectives(\"binary\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you would like to automatically rank your pipelines by this recommendation score, you can set `use_recommendation=True` when initializing `AutoMLSearch`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_recommendation = evalml.automl.AutoMLSearch(\n", " X_train=X_train,\n", " y_train=y_train,\n", " X_holdout=X_holdout,\n", " y_holdout=y_holdout,\n", " problem_type=\"binary\",\n", " use_recommendation=True,\n", ")\n", "automl_recommendation.search(interactive_plot=False)\n", "\n", "automl_recommendation.rankings[\n", " [\n", " \"id\",\n", " \"pipeline_name\",\n", " \"search_order\",\n", " \"recommendation_score\",\n", " \"holdout_score\",\n", " \"mean_cv_score\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a helper function on the `AutoMLSearch` object to help you understand how the recommendation score was calculated. It displays the raw scores of the objectives included within the score calculation. Here, we take a look at pipeline with `id=9`, the Decision Tree pipeline:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_recommendation.get_recommendation_score_breakdown(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Describe Pipeline\n", "Each pipeline is given an `id`. We can get more information about any particular pipeline using that `id`. Here, we will get more information about the pipeline with `id = 1`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl.describe_pipeline(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Pipeline\n", "We can get the object of any pipeline via their `id` as well:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline = automl.get_pipeline(1)\n", "print(pipeline.name)\n", "print(pipeline.parameters)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get best pipeline\n", "If you specifically want to get the best pipeline, there is a convenient accessor for that.\n", "The pipeline returned is already fitted on the input X, y data that we passed to AutoMLSearch. To turn off this default behavior, set `train_best_pipeline=False` when initializing AutoMLSearch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_pipeline = automl.best_pipeline\n", "print(best_pipeline.name)\n", "print(best_pipeline.parameters)\n", "best_pipeline.predict(X_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training and Scoring Multiple Pipelines using AutoMLSearch\n", "\n", "AutoMLSearch will automatically fit the best pipeline on the entire training data. It also provides an easy API for training and scoring other pipelines.\n", "\n", "If you'd like to train one or more pipelines on the entire training data, you can use the `train_pipelines` method.\n", "\n", "Similarly, if you'd like to score one or more pipelines on a particular dataset, you can use the `score_pipelines` method.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trained_pipelines = automl.train_pipelines([automl.get_pipeline(i) for i in [0, 1, 2]])\n", "trained_pipelines" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_holdout_scores = automl.score_pipelines(\n", " [trained_pipelines[name] for name in trained_pipelines.keys()],\n", " X_holdout,\n", " y_holdout,\n", " [\"Accuracy Binary\", \"F1\", \"AUC\"],\n", ")\n", "pipeline_holdout_scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving AutoMLSearch and pipelines from AutoMLSearch\n", "\n", "There are two ways to save results from AutoMLSearch. \n", "\n", "- You can save the AutoMLSearch object itself, calling `.save()` to do so. This will allow you to save the AutoMLSearch state and reload all pipelines from this.\n", "\n", "- If you want to save a pipeline from AutoMLSearch for future use, pipeline classes themselves have a `.save()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# saving the entire automl search\n", "automl.save(\"automl.cloudpickle\")\n", "automl2 = evalml.automl.AutoMLSearch.load(\"automl.cloudpickle\")\n", "# saving the best pipeline using .save()\n", "best_pipeline.save(\"pipeline.cloudpickle\")\n", "best_pipeline_copy = evalml.pipelines.PipelineBase.load(\"pipeline.cloudpickle\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Limiting the AutoML Search Space\n", "The AutoML search algorithm first trains each component in the pipeline with their default values. After the first iteration, it then tweaks the parameters of these components using the pre-defined hyperparameter ranges that these components have. To limit the search over certain hyperparameter ranges, you can specify a `search_parameters` argument with your `AutoMLSearch` parameters. These parameters will limit the hyperparameter search space or pipeline parameter space. \n", "\n", "Hyperparameter ranges can be found through the [API reference](https://evalml.alteryx.com/en/stable/api_index.html) for each component. Parameter arguments must be specified as dictionaries, but the associated values must be `skopt.space` Real, Integer, Categorical objects for setting hyperparameter ranges.\n", "\n", "If however you'd like to specify certain values for the initial batch of the AutoML search algorithm, you can use the `search_parameters` argument with non `skopt.space` objects. This will set the initial batch's component parameters to the values passed by this argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml import AutoMLSearch\n", "from evalml.demos import load_fraud\n", "from skopt.space import Categorical\n", "from evalml.model_family import ModelFamily\n", "import woodwork as ww\n", "\n", "X, y = load_fraud(n_rows=1000)\n", "\n", "# example of setting parameter to just one value\n", "search_parameters = {\"Imputer\": {\"numeric_impute_strategy\": \"mean\"}}\n", "\n", "\n", "# limit the numeric impute strategy to include only `median` and `most_frequent`\n", "# `mean` is the default value for this argument, but it doesn't need to be included in the specified hyperparameter range for this to work\n", "search_parameters = {\n", " \"Imputer\": {\"numeric_impute_strategy\": Categorical([\"median\", \"most_frequent\"])}\n", "}\n", "\n", "# using this custom hyperparameter means that our Imputer components in these pipelines will only search through\n", "# 'median' and 'most_frequent' strategies for 'numeric_impute_strategy'\n", "automl_constrained = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " search_parameters=search_parameters,\n", " verbose=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `skopt.space` Integer, Real, or Categorical will set the hyperparameter space explored during search. All other values will set the pipeline parameters directly. Setting pipeline parameters directly defines the initialization parameters that a pipeline starts with during the first batch of AutoMLSearch. the hyperparameter range then defines the space of possible new parameter values, which the tuner chooses.\n", "\n", "Let's walk through some examples to explain this. For instance,\n", "```python\n", "search_parameters = {'Imputer': {\n", " 'numeric_impute_strategy': 'mean'\n", "}}\n", "```\n", "then in the initial search, the algorithm would use `mean` as the impute strategy in batch 1. However, since `Imputer.numeric_impute_strategy` has a valid hyperparameter range, if the algorithm suggests a different strategy, it can and will change this value. To limit this to using `mean` only for the duration of the search, it is necessary to use the `skopt.space`:\n", "```python\n", "search_parameters = {'Imputer': {\n", " 'numeric_impute_strategy': Categorical(['mean'])\n", "}}\n", "```\n", "\n", "However, if a value has no hyperparameter range associated, then the algorithm will use this value as the only parameter. For instance,\n", "```python\n", "search_parameters = {'Label Encoder': {\n", " 'positive_label': True\n", "}}\n", "```\n", "Since `Label Encoder.positive_label` has no associated hyperparameter range, the algorithm will use this parameter for the entire duration of the search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imbalanced Data\n", "The AutoML search algorithm now has functionality to handle imbalanced data during classification! AutoMLSearch now provides two additional parameters, `sampler_method` and `sampler_balanced_ratio`, that allow you to let AutoMLSearch know whether to sample imbalanced data, and how to do so. `sampler_method` takes in either `Undersampler`, `Oversampler`, `auto`, or None as the sampler to use, and `sampler_balanced_ratio` specifies the `minority/majority` ratio that you want to sample to. Details on the Undersampler and Oversampler components can be found in the [documentation](https://evalml.alteryx.com/en/stable/api_index.html#transformers).\n", "\n", "This can be used for imbalanced datasets, like the fraud dataset, which has a 'minority:majority' ratio of < 0.2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_auto = AutoMLSearch(\n", " X_train=X, y_train=y, problem_type=\"binary\", automl_algorithm=\"iterative\"\n", ")\n", "automl_auto.allowed_pipelines[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Oversampler is chosen as the default sampling component here, since the `sampler_balanced_ratio = 0.25`. If you specified a lower ratio, for instance `sampler_balanced_ratio = 0.1`, then there would be no sampling component added here. This is because if a ratio of 0.1 would be considered balanced, then a ratio of 0.2 would also be balanced.\n", "\n", "The Oversampler uses SMOTE under the hood, and automatically selects whether to use SMOTE, SMOTEN, or SMOTENC based on the data it receives." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_auto_ratio = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " sampler_balanced_ratio=0.1,\n", " automl_algorithm=\"iterative\",\n", ")\n", "automl_auto_ratio.allowed_pipelines[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally, you can add more fine-grained sampling ratios by passing in a `sampling_ratio_dict` in pipeline parameters. For this dictionary, AutoMLSearch expects the keys to be int values from 0 to `n-1` for the classes, and the values would be the `sampler_balanced__ratio` associated with each target. This dictionary would override the AutoML argument `sampler_balanced_ratio`. Below, you can see the scenario for Oversampler component on this dataset. Note that the logic for Undersamplers is included in the commented section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# In this case, the majority class is the negative class\n", "# for the oversampler, we don't want to oversample this class, so class 0 (majority) will have a ratio of 1 to itself\n", "# for the minority class 1, we want to oversample it to have a minority/majority ratio of 0.5, which means we want minority to have 1/2 the samples as the minority\n", "sampler_ratio_dict = {0: 1, 1: 0.5}\n", "search_parameters = {\"Oversampler\": {\"sampler_balanced_ratio\": sampler_ratio_dict}}\n", "automl_auto_ratio_dict = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " search_parameters=search_parameters,\n", " automl_algorithm=\"iterative\",\n", ")\n", "automl_auto_ratio_dict.allowed_pipelines[-1]\n", "\n", "# Undersampler case\n", "# we don't want to undersample this class, so class 1 (minority) will have a ratio of 1 to itself\n", "# for the majority class 0, we want to undersample it to have a minority/majority ratio of 0.5, which means we want majority to have 2x the samples as the minority\n", "# sampler_ratio_dict = {0: 0.5, 1: 1}\n", "# search_parameters = {\"Oversampler\": {\"sampler_balanced_ratio\": sampler_ratio_dict}}\n", "# automl_auto_ratio_dict = AutoMLSearch(X_train=X, y_train=y, problem_type='binary', search_parameters=search_parameters)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding ensemble methods to AutoML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stacking\n", "[Stacking](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking) is an ensemble machine learning algorithm that involves training a model to best combine the predictions of several base learning algorithms. First, each base learning algorithms is trained using the given data. Then, the combining algorithm or meta-learner is trained on the predictions made by those base learning algorithms to make a final prediction.\n", "\n", "AutoML enables stacking using the `ensembling` flag during initalization; this is set to `False` by default. How ensembling runs is defined by the AutoML algorithm you choose. In the `IterativeAlgorithm`, the stacking ensemble pipeline runs in its own batch after a whole cycle of training has occurred (each allowed pipeline trains for one batch). Note that this means __a large number of iterations may need to run before the stacking ensemble runs__. It is also important to note that __only the first CV fold is calculated for stacking ensembles__ because the model internally uses CV folds. See below in the AutoML Algorithms section to see how ensembling is run for `DefaultAlgorithm`. Please do note that ensembling is currently unavailable for time series problems." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, y = evalml.demos.load_breast_cancer()\n", "\n", "automl_with_ensembling = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " allowed_model_families=[ModelFamily.LINEAR_MODEL],\n", " max_batches=4,\n", " ensembling=True,\n", " automl_algorithm=\"iterative\",\n", " verbose=True,\n", ")\n", "automl_with_ensembling.search(interactive_plot=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can view more information about the stacking ensemble pipeline (which was the best performing pipeline) by calling `.describe()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_with_ensembling.best_pipeline.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AutoML Algorithms\n", "\n", "EvalML currently has two algorithms available for users to choose from. Below, we will run through how each algorithm works and how to access them through `AutoMLSearch` as well as the top level search methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [IterativeAlgorithm](https://evalml.alteryx.com/en/stable/autoapi/evalml/automl/automl_algorithm/iterative_algorithm/index.html)\n", "\n", "`IterativeAlgorithm` is the first AutoML Algorithm created in EvalML and can be acessed with the `search_iterative` method or specifiying `AutoMLSearch(automl_algorithm='iterative')`. The algorithm works as follows:\n", "\n", "- Every batch (after the initial baseline model) contains pipelines of all available estimators for the specified problem type\n", "- Pipelines contain preprocessing (imputing, encoding, etc.) needed for machine learning but no feature selection is applied\n", "- Ensembling can be turned on by passing in the `ensembling=True` parameter and will be run after a whole cycle of training has occurred (each allowed pipeline trains for one batch)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import evalml\n", "\n", "X, y = evalml.demos.load_fraud(n_rows=250)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.automl import search_iterative\n", "\n", "# top level search method will run `AutoMLSearch` with `IterativeAlgorithm` as well as apply our default data checks\n", "auto_iterative, messages_iterative = search_iterative(X, y, problem_type=\"binary\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml import AutoMLSearch\n", "\n", "auto_iterative = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " automl_algorithm=\"iterative\",\n", " verbose=True,\n", ")\n", "auto_iterative.search(interactive_plot=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [DefaultAlgorithm](https://evalml.alteryx.com/en/stable/autoapi/evalml/automl/automl_algorithm/default_algorithm/index.html)\n", "\n", "`DefaultAlgorithm` was designed to do three main things:\n", "\n", "1. Abstract out more parameters and decisions from the user.\n", "\n", "2. Perform deeper tuning for high performing pipelines.\n", "\n", "3. Create a platform to introduce feature selection as well as other potential techniques/heuristics for AutoML.\n", "\n", "`DefaultAlgorithm` does this by creating the concept of two modes: `fast` and `long`, where `fast` is a subset of long. The algorithm runs as follows:\n", "\n", "1. Run naive pipelines:\n", " a. a random forest pipeline with the default preprocessing pipeline\n", " \n", "2. Run the same pipelines, this time with feature selection. Subsequent pipelines will use the selected features with a SelectedColumns transformer.\n", "\n", "3. Run all pipelines with preprocessing components:\n", " a. scan rest of estimators (`IterativeAlgorithm` batch 1).\n", " \n", "4. First ensembling run\n", "\n", "Fast mode ends here. Begin long mode.\n", "\n", "6. Run top 3 estimators:\n", " a. Generate 50 random parameter sets. Run all 150 in one batch\n", " \n", "7. Second ensembling run\n", "\n", "8. Repeat 8a and 8b indefinitely until the specified time in `AutoMLSearch` is met:\n", " a. For each of the previous top 3 estimators, sample 10 parameters from the tuner. Run all 30 in one batch\n", " b. Run ensembling\n", " \n", "To this end, it is recommended to use the top level `search()` method to run `DefaultAlgorithm`. This allows users to specify running search with just the `mode` parameter, where `fast` is recommended for users who want a fast scan at how EvalML pipelines will perform on their problem and where `long` is reserved for a deeper dive into high performing pipelines. If one needs finer control over AutoML parameters, one can also specify `automl_algorithm='default'` using `AutoMLSearch` and it will default to using `fast` mode. However, in this case ensembling will be defined by the `ensembling` flag (if `ensembling=False` the abovementioned ensembling batches will be skipped). Users are welcome to select `max_batches` according to the algorithm above (or other stopping criteria) but should be aware that results may not be optimal if the algorithm does not run for the full length of `fast` mode. Note that the `allowed_model_families` and `excluded_model_families` parameters are only applied to the non-naive batches in the default algorithms. If users want to apply these to all estimators, use the iterative algorithm by specifying `automl_algorithm='iterative'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.automl import search\n", "\n", "# top level search method will run `AutoMLSearch` with `IterativeAlgorithm` as well as apply our default data checks\n", "auto_default, messages_default = search(X, y, problem_type=\"binary\", mode=\"fast\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml import AutoMLSearch\n", "\n", "auto_default = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " automl_algorithm=\"default\",\n", " ensembling=True,\n", " verbose=True,\n", ")\n", "auto_default.search(interactive_plot=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipeline differences\n", "Through the search output above, we can see how pipelines differ between `IterativeAlgorithm` and `DefaultAlgorithm`. This is because `DefaultAlgorithm` utilizes new components such as `RFRegressorSelectFromModel` and other column selectors for feature selection as well as a new pipeline structure to handle feature selection for categorical and non-categorical features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto_iterative.get_pipeline(4).graph()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto_default.get_pipeline(6).graph()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Access raw results\n", "\n", "The `AutoMLSearch` class records detailed results information under the `results` field, including information about the cross-validation scoring and parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pprint\n", "\n", "pp = pprint.PrettyPrinter(indent=0, width=100, depth=3, compact=True, sort_dicts=False)\n", "\n", "pp.pprint(automl.results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there are errors, such as with the Iterative Algorithm example above, we can examine these closer by accessing the `errors` field. There is one dictionary entry per pipeline fold that failed, and each entry contains the pipeline parameters with the error that was thrown and its full traceback." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto_iterative.errors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parallel AutoML\n", "\n", "By default, all pipelines in an AutoML batch are evaluated in series. Pipelines can be evaluated in parallel to improve performance during AutoML search. This is accomplished by a futures style submission and evaluation of pipelines in a batch. As of this writing, the pipelines use a threaded model for concurrent evaluation. This is similar to the currently implemented `n_jobs` parameter in the estimators, which uses increased numbers of threads to train and evaluate estimators." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick Start\n", "\n", "To quickly use some parallelism to enhance the pipeline searching, a string can be passed through to AutoMLSearch during initialization to setup the parallel engine and client within the AutoMLSearch object. The current options are \"cf_threaded\", \"cf_process\", \"dask_threaded\" and \"dask_process\" and indicate the futures backend to use and whether to use threaded- or process-level parallelism." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "automl_cf_threaded = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " allowed_model_families=[ModelFamily.LINEAR_MODEL],\n", " engine=\"cf_threaded\",\n", ")\n", "automl_cf_threaded.search(interactive_plot=False)\n", "automl_cf_threaded.close_engine()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parallelism with Concurrent Futures\n", "\n", "The `EngineBase` class is robust and extensible enough to support futures-like implementations from a variety of libraries. The `CFEngine` extends the `EngineBase` to use the native Python [concurrent.futures library](https://docs.python.org/3/library/concurrent.futures.html). The `CFEngine` supports both thread- and process-level parallelism. The type of parallelism can be chosen using either the `ThreadPoolExecutor` or the `ProcessPoolExecutor`. If either executor is passed a `max_workers` parameter, it will set the number of processes and threads spawned. If not, the default number of processes will be equal to the number of processors available and the number of threads set to five times the number of processors available.\n", "\n", "Here, the CFEngine is invoked with default parameters, which is threaded parallelism using all available threads." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from concurrent.futures import ThreadPoolExecutor\n", "\n", "from evalml.automl.engine.cf_engine import CFEngine, CFClient\n", "\n", "cf_engine = CFEngine(CFClient(ThreadPoolExecutor(max_workers=4)))\n", "automl_cf_threaded = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " allowed_model_families=[ModelFamily.LINEAR_MODEL],\n", " engine=cf_engine,\n", ")\n", "automl_cf_threaded.search(interactive_plot=False)\n", "automl_cf_threaded.close_engine()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: the cell demonstrating process-level parallelism is a markdown due to incompatibility with our ReadTheDocs build. It can be run successfully locally.\n", "\n", "```python\n", "from concurrent.futures import ProcessPoolExecutor\n", "\n", "# Repeat the process but using process-level parallelism\\\n", "cf_engine = CFEngine(CFClient(ProcessPoolExecutor(max_workers=2)))\n", "automl_cf_process = AutoMLSearch(X_train=X, y_train=y,\n", " problem_type=\"binary\",\n", " engine=\"cf_process\")\n", "automl_cf_process.search(interactive_plot = False)\n", "automl_cf_process.close_engine()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parallelism with Dask\n", "\n", "Thread or process level parallelism can be explicitly invoked for the `DaskEngine` (as well as the `CFEngine`). The `processes` can be set to `True` and the number of processes set using `n_workers`. If `processes` is set to `False`, then the resulting parallelism will be threaded and `n_workers` will represent the threads used. Examples of both follow." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dask.distributed import LocalCluster\n", "\n", "from evalml.automl.engine import DaskEngine\n", "\n", "dask_engine_p2 = DaskEngine(cluster=LocalCluster(processes=True, n_workers=2))\n", "automl_dask_p2 = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " allowed_model_families=[ModelFamily.LINEAR_MODEL],\n", " engine=dask_engine_p2,\n", ")\n", "automl_dask_p2.search(interactive_plot=False)\n", "\n", "# Explicitly shutdown the automl object's LocalCluster\n", "automl_dask_p2.close_engine()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dask_engine_t4 = DaskEngine(cluster=LocalCluster(processes=False, n_workers=4))\n", "\n", "automl_dask_t4 = AutoMLSearch(\n", " X_train=X,\n", " y_train=y,\n", " problem_type=\"binary\",\n", " allowed_model_families=[ModelFamily.LINEAR_MODEL],\n", " engine=dask_engine_t4,\n", ")\n", "automl_dask_t4.search(interactive_plot=False)\n", "automl_dask_t4.close_engine()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, a significant performance gain can result from simply using something other than the default `SequentialEngine`, ranging from a 100% speed up with multiple processes to 500% speedup with multiple threads!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Sequential search duration: %s\" % str(automl.search_duration))\n", "print(\n", " \"Concurrent futures (threaded) search duration: %s\"\n", " % str(automl_cf_threaded.search_duration)\n", ")\n", "print(\"Dask (two processes) search duration: %s\" % str(automl_dask_p2.search_duration))\n", "print(\"Dask (four threads)search duration: %s\" % str(automl_dask_t4.search_duration))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" }, "vscode": { "interpreter": { "hash": "fb5a3afe2d0dd7ad0fd9e1ff89de4e0e95804490c629f36065bf8d930a66d311" } } }, "nbformat": 4, "nbformat_minor": 4 }