Skip to main content
Ctrl+K
Logo image

Site Navigation

  • Install
  • Start
  • Tutorials
  • User Guide
  • API Reference
  • Release Notes

Site Navigation

  • Install
  • Start
  • Tutorials
  • User Guide
  • API Reference
  • Release Notes

Section Navigation

  • Automated Machine Learning (AutoML) Search
  • Pipelines
  • Component Graphs
  • Components
  • Objective Functions
  • Model Understanding
  • Data Checks
  • Data Check Actions
  • Utilities
  • Time Series Modelling
  • FAQ
  • User Guide
  • Automated Machine Learning (AutoML) Search

Automated Machine Learning (AutoML) Search#

Background#

Machine Learning#

Machine learning (ML) is the process of constructing a mathematical model of a system based on a sample dataset collected from that system.

One of the main goals of training an ML model is to teach the model to separate the signal present in the data from the noise inherent in system and in the data collection process. If this is done effectively, the model can then be used to make accurate predictions about the system when presented with new, similar data. Additionally, introspecting on an ML model can reveal key information about the system being modeled, such as which inputs and transformations of the inputs are most useful to the ML model for learning the signal in the data, and are therefore the most predictive.

There are a variety of ML problem types. Supervised learning describes the case where the collected data contains an output value to be modeled and a set of inputs with which to train the model. EvalML focuses on training supervised learning models.

EvalML supports three common supervised ML problem types. The first is regression, where the target value to model is a continuous numeric value. Next are binary and multiclass classification, where the target value to model consists of two or more discrete values or categories. The choice of which supervised ML problem type is most appropriate depends on domain expertise and on how the model will be evaluated and used.

EvalML is currently building support for supervised time series problems: time series regression, time series binary classification, and time series multiclass classification. While we’ve added some features to tackle these kinds of problems, our functionality is still being actively developed so please be mindful of that before using it.

AutoML and Search#

AutoML is the process of automating the construction, training and evaluation of ML models. Given a data and some configuration, AutoML searches for the most effective and accurate ML model or models to fit the dataset. During the search, AutoML will explore different combinations of model type, model parameters and model architecture.

An effective AutoML solution offers several advantages over constructing and tuning ML models by hand. AutoML can assist with many of the difficult aspects of ML, such as avoiding overfitting and underfitting, imbalanced data, detecting data leakage and other potential issues with the problem setup, and automatically applying best-practice data cleaning, feature engineering, feature selection and various modeling techniques. AutoML can also leverage search algorithms to optimally sweep the hyperparameter search space, resulting in model performance which would be difficult to achieve by manual training.

AutoML in EvalML#

EvalML supports all of the above and more.

In its simplest usage, the AutoML search interface requires only the input data, the target data and a problem_type specifying what kind of supervised ML problem to model.

** Graphing methods, like verbose AutoMLSearch, on Jupyter Notebook and Jupyter Lab require ipywidgets to be installed.

** If graphing on Jupyter Lab, jupyterlab-plotly required. To download this, make sure you have npm installed.

[1]:
import evalml
from evalml.utils import infer_feature_types

X, y = evalml.demos.load_fraud(n_rows=650)
             Number of Features
Boolean                       1
Categorical                   6
Numeric                       5

Number of training examples: 650
Targets
False    86.31%
True     13.69%
Name: fraud, dtype: object

To provide data to EvalML, it is recommended that you initialize a Woodwork accessor on your data. This allows you to easily control how EvalML will treat each of your features before training a model.

EvalML also accepts pandas input, and will run type inference on top of the input pandas data. If you’d like to change the types inferred by EvalML, you can use the infer_feature_types utility method, which takes pandas or numpy input and converts it to a Woodwork data structure. The feature_types parameter can be used to specify what types specific columns should be.

Feature types such as Natural Language must be specified in this way, otherwise Woodwork will infer it as Unknown type and drop it during the AutoMLSearch.

In the example below, we reformat a couple features to make them easily consumable by the model, and then specify that the provider, which would have otherwise been inferred as a column with natural language, is a categorical column.

[2]:
X.ww["expiration_date"] = X["expiration_date"].apply(
    lambda x: "20{}-01-{}".format(x.split("/")[1], x.split("/")[0])
)
X = infer_feature_types(
    X,
    feature_types={
        "store_id": "categorical",
        "expiration_date": "datetime",
        "lat": "categorical",
        "lng": "categorical",
        "provider": "categorical",
    },
)

In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.

[3]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2
)

Data Checks#

Before calling AutoMLSearch.search, we should run some sanity checks on our data to ensure that the input data being passed will not run into some common issues before running a potentially time-consuming search. EvalML has various data checks that makes this easy. Each data check will return a collection of warnings and errors if it detects potential issues with the input data. This allows users to inspect their data to avoid confusing errors that may arise during the search process. You can learn about each of the data checks available through our data checks guide.

Here, we will run the DefaultDataChecks class, which contains a series of data checks that are generally useful.

[4]:
from evalml.data_checks import DefaultDataChecks

data_checks = DefaultDataChecks("binary", "log loss binary")
data_checks.validate(X_train, y_train)
[4]:
[]

Since there were no warnings or errors returned, we can safely continue with the search process.

Holdout Set for Pipeline Ranking#

If the holdout_set_size parameter is set and the input dataset has more than 500 rows, AutoMLSearch will create a holdout set from holdout_set_size of the training data. Alternatively, a holdout set can be manually specified by using the X_holdout and y_holdout parameters in AutoMLSearch(). In this example, the holdout set created previously will be used by AutoML search.

During the AutoML search process, the mean of the objective scores of all cross validation folds (shown the “mean_cv_score” column in the pipeline rankings), is calculated. This score is passed to the AutoML search tuner to further optimize the hyperparameters of the next batch of pipelines.

After, the pipeline will be fitted on the entire training dataset and scored on this new holdout set. This score is represented under the “ranking_score” column on the pipeline rankings board and is used to rank pipeline performance.

If a dataset has less than 500 rows or holdout_set_size=0 (which is the default setting), the “mean_cv_score” will be used as the ranking_score instead.

[5]:
automl = evalml.automl.AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    X_holdout=X_holdout,
    y_holdout=y_holdout,
    problem_type="binary",
    verbose=True,
)
automl.search(interactive_plot=False)
AutoMLSearch will use the holdout set to score and rank pipelines.
Removing columns ['currency'] because they are of 'Unknown' type
Using default limit of max_batches=3.


*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 3 batches for a total of None pipelines.
Allowed model families:

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 4.921
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 4.991

*****************************
* Evaluating Batch Number 1 *
*****************************

Logistic Regression Classifier w/ Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + Standard Scaler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.378
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.392
Random Forest Classifier w/ Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.259
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.219

*****************************
* Evaluating Batch Number 2 *
*****************************

Logistic Regression Classifier w/ Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + Standard Scaler + RF Classifier Select From Model:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.376
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.340
Random Forest Classifier w/ Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.250
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.203

*****************************
* Evaluating Batch Number 3 *
*****************************

Decision Tree Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Replace Nullable Types Transformer + Imputer + One Hot Encoder + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 1.449
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 1.532
        High coefficient of variation (cv >= 0.5) within cross validation scores.
        Decision Tree Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Replace Nullable Types Transformer + Imputer + One Hot Encoder + Oversampler may not perform as estimated on unseen data.
LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Replace Nullable Types Transformer + Imputer + One Hot Encoder + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.319
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.183
Extra Trees Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Replace Nullable Types Transformer + Imputer + One Hot Encoder + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.362
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.352
Elastic Net Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Replace Nullable Types Transformer + Imputer + One Hot Encoder + Standard Scaler + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.375
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.400
CatBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Replace Nullable Types Transformer + Imputer + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.575
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.552
XGBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Replace Nullable Types Transformer + Imputer + One Hot Encoder + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.253
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.151

Search finished after 00:50
Best pipeline: XGBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Replace Nullable Types Transformer + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Replace Nullable Types Transformer + Imputer + One Hot Encoder + Oversampler
Best pipeline Log Loss Binary: 0.150534