API Reference

Demo Datasets

load_breast_cancer

Load breast cancer dataset. Binary classification problem.

load_diabetes

Load diabetes dataset. Regression problem

load_fraud

Load credit card fraud dataset.

load_wine

Load wine dataset. Multiclass problem.

load_churn

Load credit card fraud dataset.

Preprocessing

Utilities to preprocess data before using evalml.

load_data

Load features and target from file.

drop_nan_target_rows

Drops rows in X and y when row in the target y has a value of NaN.

target_distribution

Get the target distributions.

number_of_features

Get the number of features of each specific dtype in a DataFrame.

split_data

Splits data into train and test sets.

Exceptions

MethodPropertyNotFoundError

Exception to raise when a class is does not have an expected method or property.

PipelineNotFoundError

An exception raised when a particular pipeline is not found in automl search results

ObjectiveNotFoundError

Exception to raise when specified objective does not exist.

MissingComponentError

An exception raised when a component is not found in all_components()

ComponentNotYetFittedError

An exception to be raised when predict/predict_proba/transform is called on a component without fitting first.

PipelineNotYetFittedError

An exception to be raised when predict/predict_proba/transform is called on a pipeline without fitting first.

AutoMLSearchException

Exception raised when all pipelines in an automl batch return a score of NaN for the primary objective.

EnsembleMissingPipelinesError

An exception raised when an ensemble is missing estimators (list) as a parameter.

PipelineScoreError

An exception raised when a pipeline errors while scoring any objective in a list of objectives.

DataCheckInitError

Exception raised when a data check can’t initialize with the parameters given.

NullsInColumnWarning

Warning thrown when there are null values in the column of interest

AutoML

AutoML Search Interface

AutoMLSearch

Automated Pipeline search.

AutoML Utils

search

Given data and configuration, run an automl search.

get_default_primary_search_objective

Get the default primary search objective for a problem type.

make_data_splitter

Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.

AutoML Algorithm Classes

AutoMLAlgorithm

Base class for the AutoML algorithms which power EvalML.

IterativeAlgorithm

An automl algorithm which first fits a base round of pipelines with default parameters, then does a round of parameter tuning on each pipeline in order of performance.

AutoML Callbacks

silent_error_callback

No-op.

log_error_callback

Logs the exception thrown as an error. Will not throw. This is the default behavior for AutoMLSearch.

raise_error_callback

Raises the exception thrown by the AutoMLSearch object. Also logs the exception as an error.

AutoML Engines

SequentialEngine

The default engine for the AutoML search. Trains and scores pipelines locally and sequentially.

CFEngine

The concurrent.futures (CF) engine

DaskEngine

The dask engine

Pipelines

Pipeline Base Classes

PipelineBase

Machine learning pipeline made out of transformers and an Estimator.

ClassificationPipeline

Pipeline subclass for all classification pipelines.

BinaryClassificationPipeline

Pipeline subclass for all binary classification pipelines.

MulticlassClassificationPipeline

Pipeline subclass for all multiclass classification pipelines.

RegressionPipeline

Pipeline subclass for all regression pipelines.

TimeSeriesClassificationPipeline

Pipeline base class for time series classification problems.

TimeSeriesBinaryClassificationPipeline

Pipeline base class for time series binary classification problems.

TimeSeriesMulticlassClassificationPipeline

Pipeline base class for time series multiclass classification problems.

TimeSeriesRegressionPipeline

Pipeline base class for time series regression problems.

Pipeline Utils

make_pipeline

Given input data, target data, an estimator class and the problem type,

generate_pipeline_code

Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline.

Components

Component Base Classes

Components represent a step in a pipeline.

ComponentBase

Base class for all components.

Transformer

A component that may or may not need fitting that transforms data.

Estimator

A component that fits and predicts given data.

Component Utils

allowed_model_families

List the model types allowed for a particular problem type.

get_estimators

Returns the estimators allowed for a particular problem type.

generate_component_code

Creates and returns a string that contains the Python imports and code required for running the EvalML component.

Transformers

Transformers are components that take in data as input and output transformed data.

DropColumns

Drops specified columns in input data.

SelectColumns

Selects specified columns in input data.

SelectByType

Selects columns by specified Woodwork logical type or semantic tag in input data.

OneHotEncoder

A transformer that encodes categorical features in a one-hot numeric array.

TargetEncoder

A transformer that encodes categorical features into target encodings.

PerColumnImputer

Imputes missing data according to a specified imputation strategy per column.

Imputer

Imputes missing data according to a specified imputation strategy.

SimpleImputer

Imputes missing data according to a specified imputation strategy.

StandardScaler

A transformer that standardizes input features by removing the mean and scaling to unit variance.

RFRegressorSelectFromModel

Selects top features based on importance weights using a Random Forest regressor.

RFClassifierSelectFromModel

Selects top features based on importance weights using a Random Forest classifier.

DropNullColumns

Transformer to drop features whose percentage of NaN values exceeds a specified threshold.

DateTimeFeaturizer

Transformer that can automatically extract features from datetime columns.

TextFeaturizer

Transformer that can automatically featurize text columns using featuretools’ nlp_primitives.

DelayedFeatureTransformer

Transformer that delays input features and target variable for time series problems.

DFSTransformer

Featuretools DFS component that generates features for the input features.

PolynomialDetrender

Removes trends from time series by fitting a polynomial to the data.

Undersampler

Initializes an undersampling transformer to downsample the majority classes in the dataset.

Oversampler

SMOTE Oversampler component. Will automatically select whether to use SMOTE, SMOTEN, or SMOTENC based on inputs to the component.

Estimators

Classifiers

Classifiers are components that output a predicted class label.

CatBoostClassifier

CatBoost Classifier, a classifier that uses gradient-boosting on decision trees.

ElasticNetClassifier

Elastic Net Classifier. Uses Logistic Regression with elasticnet penalty as the base estimator.

ExtraTreesClassifier

Extra Trees Classifier.

RandomForestClassifier

Random Forest Classifier.

LightGBMClassifier

LightGBM Classifier.

LogisticRegressionClassifier

Logistic Regression Classifier.

XGBoostClassifier

XGBoost Classifier.

BaselineClassifier

Classifier that predicts using the specified strategy.

SklearnStackedEnsembleClassifier

Scikit-learn Stacked Ensemble Classifier.

DecisionTreeClassifier

Decision Tree Classifier.

KNeighborsClassifier

K-Nearest Neighbors Classifier.

SVMClassifier

Support Vector Machine Classifier.

Regressors

Regressors are components that output a predicted target value.

ARIMARegressor

Autoregressive Integrated Moving Average Model.

CatBoostRegressor

CatBoost Regressor, a regressor that uses gradient-boosting on decision trees.

ElasticNetRegressor

Elastic Net Regressor.

LinearRegressor

Linear Regressor.

ExtraTreesRegressor

Extra Trees Regressor.

RandomForestRegressor

Random Forest Regressor.

XGBoostRegressor

XGBoost Regressor.

BaselineRegressor

Baseline regressor that uses a simple strategy to make predictions.

TimeSeriesBaselineEstimator

Time series estimator that predicts using the naive forecasting approach.

SklearnStackedEnsembleRegressor

Scikit-learn Stacked Ensemble Regressor.

DecisionTreeRegressor

Decision Tree Regressor.

LightGBMRegressor

LightGBM Regressor.

SVMRegressor

Support Vector Machine Regressor.

Model Understanding

Utility Methods

confusion_matrix

Confusion matrix for binary and multiclass classification.

normalize_confusion_matrix

Normalizes a confusion matrix.

precision_recall_curve

Given labels and binary classifier predicted probabilities, compute and return the data representing a precision-recall curve.

roc_curve

Given labels and classifier predicted probabilities, compute and return the data representing a Receiver Operating Characteristic (ROC) curve. Works with binary or multiclass problems.

calculate_permutation_importance

Calculates permutation importance for features.

calculate_permutation_importance_one_column

Calculates permutation importance for one column in the original dataframe.

binary_objective_vs_threshold

Computes objective score as a function of potential binary classification

get_prediction_vs_actual_over_time_data

Get the data needed for the prediction_vs_actual_over_time plot.

partial_dependence

Calculates one or two-way partial dependence. If a single integer or

get_prediction_vs_actual_data

Combines y_true and y_pred into a single dataframe and adds a column for outliers. Used in graph_prediction_vs_actual().

get_linear_coefficients

Returns a dataframe showing the features with the greatest predictive power for a linear model.

t_sne

Get the transformed output after fitting X to the embedded space using t-SNE.

Graph Utility Methods

graph_precision_recall_curve

Generate and display a precision-recall plot.

graph_roc_curve

Generate and display a Receiver Operating Characteristic (ROC) plot for binary and multiclass classification problems.

graph_confusion_matrix

Generate and display a confusion matrix plot.

graph_permutation_importance

Generate a bar graph of the pipeline’s permutation importance.

graph_binary_objective_vs_threshold

Generates a plot graphing objective score vs. decision thresholds for a fitted binary classification pipeline.

graph_prediction_vs_actual

Generate a scatter plot comparing the true and predicted values. Used for regression plotting

graph_prediction_vs_actual_over_time

Plot the target values and predictions against time on the x-axis.

graph_partial_dependence

Create an one-way or two-way partial dependence plot. Passing a single integer or

graph_t_sne

Plot high dimensional data into lower dimensional space using t-SNE .

Prediction Explanations

explain_predictions

Creates a report summarizing the top contributing features for each data point in the input features.

explain_predictions_best_worst

Creates a report summarizing the top contributing features for the best and worst points in the dataset as measured by error to true labels.

Objectives

Objective Base Classes

ObjectiveBase

Base class for all objectives.

BinaryClassificationObjective

Base class for all binary classification objectives.

MulticlassClassificationObjective

Base class for all multiclass classification objectives.

RegressionObjective

Base class for all regression objectives.

Domain-Specific Objectives

FraudCost

Score the percentage of money lost of the total transaction amount process due to fraud.

LeadScoring

Lead scoring.

CostBenefitMatrix

Score using a cost-benefit matrix. Scores quantify the benefits of a given value, so greater numeric

Classification Objectives

AccuracyBinary

Accuracy score for binary classification.

AccuracyMulticlass

Accuracy score for multiclass classification.

AUC

AUC score for binary classification.

AUCMacro

AUC score for multiclass classification using macro averaging.

AUCMicro

AUC score for multiclass classification using micro averaging.

AUCWeighted

AUC Score for multiclass classification using weighted averaging.

Gini

Gini coefficient for binary classification.

BalancedAccuracyBinary

Balanced accuracy score for binary classification.

BalancedAccuracyMulticlass

Balanced accuracy score for multiclass classification.

F1

F1 score for binary classification.

F1Micro

F1 score for multiclass classification using micro averaging.

F1Macro

F1 score for multiclass classification using macro averaging.

F1Weighted

F1 score for multiclass classification using weighted averaging.

LogLossBinary

Log Loss for binary classification.

LogLossMulticlass

Log Loss for multiclass classification.

MCCBinary

Matthews correlation coefficient for binary classification.

MCCMulticlass

Matthews correlation coefficient for multiclass classification.

Precision

Precision score for binary classification.

PrecisionMicro

Precision score for multiclass classification using micro averaging.

PrecisionMacro

Precision score for multiclass classification using macro averaging.

PrecisionWeighted

Precision score for multiclass classification using weighted averaging.

Recall

Recall score for binary classification.

RecallMicro

Recall score for multiclass classification using micro averaging.

RecallMacro

Recall score for multiclass classification using macro averaging.

RecallWeighted

Recall score for multiclass classification using weighted averaging.

Regression Objectives

R2

Coefficient of determination for regression.

MAE

Mean absolute error for regression.

MAPE

Mean absolute percentage error for time series regression. Scaled by 100 to return a percentage.

MSE

Mean squared error for regression.

MeanSquaredLogError

Mean squared log error for regression.

MedianAE

Median absolute error for regression.

MaxError

Maximum residual error for regression.

ExpVariance

Explained variance score for regression.

RootMeanSquaredError

Root mean squared error for regression.

RootMeanSquaredLogError

Root mean squared log error for regression.

Objective Utils

get_all_objective_names

Get a list of the names of all objectives.

get_core_objectives

Returns all core objective instances associated with the given problem type.

get_core_objective_names

Get a list of all valid core objectives.

get_non_core_objectives

Get non-core objective classes.

get_objective

Returns the Objective class corresponding to a given objective name.

Problem Types

handle_problem_types

Handles problem_type by either returning the ProblemTypes or converting from a str.

detect_problem_type

Determine the type of problem is being solved based on the targets (binary vs multiclass classification, regression)

ProblemTypes

Enum defining the supported types of machine learning problems.

Model Family

handle_model_family

Handles model_family by either returning the ModelFamily or converting from a string

ModelFamily

Enum for family of machine learning models.

Tuners

Tuner

Defines API for base Tuner classes.

SKOptTuner

Bayesian Optimizer.

GridSearchTuner

Grid Search Optimizer, which generates all of the possible points to search for using a grid.

RandomSearchTuner

Random Search Optimizer.

Data Checks

Data Check Classes

DataCheck

Base class for all data checks. Data checks are a set of heuristics used to determine if there are problems with input data.

InvalidTargetDataCheck

Checks if the target data contains missing or invalid values.

HighlyNullDataCheck

Checks if there are any highly-null columns and rows in the input.

IDColumnsDataCheck

Check if any of the features are likely to be ID columns.

TargetLeakageDataCheck

Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.

OutliersDataCheck

Checks if there are any outliers in input data by using IQR to determine score anomalies. Columns with score anomalies are considered to contain outliers.

NoVarianceDataCheck

Check if the target or any of the features have no variance.

ClassImbalanceDataCheck

Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.

MulticollinearityDataCheck

Check if any set features are likely to be multicollinear.

DateTimeNaNDataCheck

Checks each column in the input for datetime features and will issue an error if NaN values are present.

NaturalLanguageNaNDataCheck

Checks each column in the input for natural language features and will issue an error if NaN values are present.

DateTimeFormatDataCheck

Checks if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order

DataChecks

A collection of data checks.

DefaultDataChecks

A collection of basic data checks that is used by AutoML by default.

Data Check Messages

DataCheckMessage

Base class for a message returned by a DataCheck, tagged by name.

DataCheckError

DataCheckMessage subclass for errors returned by data checks.

DataCheckWarning

DataCheckMessage subclass for warnings returned by data checks.

Data Check Message Types

DataCheckMessageType

Enum for type of data check message: WARNING or ERROR.

Data Check Message Codes

DataCheckMessageCode

Enum for data check message code.

Utils

General Utils

import_or_raise

Attempts to import the requested library by name.

convert_to_seconds

Converts a string describing a length of time to its length in seconds.

get_random_state

Generates a numpy.random.RandomState instance using seed.

get_random_seed

Given a numpy.random.RandomState object, generate an int representing a seed value for another random number generator. Or, if given an int, return that int.

pad_with_nans

Pad the beginning num_to_pad rows with nans.

drop_rows_with_nans

Drop rows that have any NaNs in all dataframes or series.

infer_feature_types

Create a Woodwork structure from the given list, pandas, or numpy input, with specified types for columns.

save_plot

Saves fig to filepath if specified, or to a default location if not.

is_all_numeric

Checks if the given DataFrame contains only numeric values

get_importable_subclasses

Get importable subclasses of a base class. Used to list all of our