API Reference¶
Demo Datasets¶
Load breast cancer dataset. Binary classification problem. |
|
Load diabetes dataset. Used for regression problem. |
|
Load credit card fraud dataset. |
|
Load wine dataset. Multiclass problem. |
|
Load churn dataset, which can be used for binary classification problems. |
Preprocessing¶
Utilities to preprocess data before using evalml.
Load features and target from file. |
|
Get the target distributions. |
|
Get the number of features of each specific dtype in a DataFrame. |
|
Split data into train and test sets. |
Exceptions¶
Exception to raise when a class is does not have an expected method or property. |
|
An exception raised when a particular pipeline is not found in automl search results. |
|
Exception to raise when specified objective does not exist. |
|
An exception raised when a component is not found in all_components(). |
|
An exception to be raised when predict/predict_proba/transform is called on a component without fitting first. |
|
An exception to be raised when predict/predict_proba/transform is called on a pipeline without fitting first. |
|
Exception raised when all pipelines in an automl batch return a score of NaN for the primary objective. |
|
An exception raised when an ensemble is missing estimators (list) as a parameter. |
|
An exception raised when a pipeline errors while scoring any objective in a list of objectives. |
|
Exception raised when a data check can’t initialize with the parameters given. |
|
Warning thrown when there are null values in the column of interest. |
AutoML¶
AutoML Search Interface¶
Automated Pipeline search. |
AutoML Utils¶
Given data and configuration, run an automl search. |
|
Get the default primary search objective for a problem type. |
|
Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search. |
AutoML Algorithm Classes¶
Base class for the AutoML algorithms which power EvalML. |
|
An automl algorithm which first fits a base round of pipelines with default parameters, then does a round of parameter tuning on each pipeline in order of performance. |
AutoML Callbacks¶
No-op. |
|
Logs the exception thrown as an error. |
|
Raises the exception thrown by the AutoMLSearch object. |
AutoML Engines¶
The default engine for the AutoML search. |
|
The concurrent.futures (CF) engine. |
|
The dask engine. |
Pipelines¶
Pipeline Base Classes¶
Machine learning pipeline. |
|
Pipeline subclass for all classification pipelines. |
|
Pipeline subclass for all binary classification pipelines. |
|
Pipeline subclass for all multiclass classification pipelines. |
|
Pipeline subclass for all regression pipelines. |
|
Pipeline base class for time series classification problems. |
|
Pipeline base class for time series binary classification problems. |
|
Pipeline base class for time series multiclass classification problems. |
|
Pipeline base class for time series regression problems. |
Pipeline Utils¶
Given input data, target data, an estimator class and the problem type, generates a pipeline class with a preprocessing chain which was recommended based on the inputs. The pipeline will be a subclass of the appropriate pipeline base class for the specified problem_type. |
|
Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline. |
Component Graphs¶
Component graph for a pipeline as a directed acyclic graph (DAG). |
Components¶
Component Base Classes¶
Components represent a step in a pipeline.
Base class for all components. |
|
A component that may or may not need fitting that transforms data. These components are used before an estimator. |
|
A component that fits and predicts given data. |
Component Utils¶
List the model types allowed for a particular problem type. |
|
Returns the estimators allowed for a particular problem type. |
|
Creates and returns a string that contains the Python imports and code required for running the EvalML component. |
Transformers¶
Transformers are components that take in data as input and output transformed data.
Drops specified columns in input data. |
|
Selects specified columns in input data. |
|
Selects columns by specified Woodwork logical type or semantic tag in input data. |
|
A transformer that encodes categorical features in a one-hot numeric array. |
|
A transformer that encodes categorical features into target encodings. |
|
Imputes missing data according to a specified imputation strategy per column. |
|
Imputes missing data according to a specified imputation strategy. |
|
Imputes missing data according to a specified imputation strategy. |
|
A transformer that standardizes input features by removing the mean and scaling to unit variance. |
|
Selects top features based on importance weights using a Random Forest regressor. |
|
Selects top features based on importance weights using a Random Forest classifier. |
|
Transformer to drop features whose percentage of NaN values exceeds a specified threshold. |
|
Transformer that can automatically extract features from datetime columns. |
|
Transformer that can automatically featurize text columns using featuretools’ nlp_primitives. |
|
Transformer that delays input features and target variable for time series problems. |
|
Featuretools DFS component that generates features for the input features. |
|
Removes trends from time series by fitting a polynomial to the data. |
|
Initializes an undersampling transformer to downsample the majority classes in the dataset. |
|
SMOTE Oversampler component. Will automatically select whether to use SMOTE, SMOTEN, or SMOTENC based on inputs to the component. |
Estimators¶
Classifiers¶
Classifiers are components that output a predicted class label.
CatBoost Classifier, a classifier that uses gradient-boosting on decision trees. CatBoost is an open-source library and natively supports categorical features. |
|
Elastic Net Classifier. Uses Logistic Regression with elasticnet penalty as the base estimator. |
|
Extra Trees Classifier. |
|
Random Forest Classifier. |
|
LightGBM Classifier. |
|
Logistic Regression Classifier. |
|
XGBoost Classifier. |
|
Classifier that predicts using the specified strategy. |
|
Scikit-learn Stacked Ensemble Classifier. |
|
Stacked Ensemble Classifier. |
|
Decision Tree Classifier. |
|
K-Nearest Neighbors Classifier. |
|
Support Vector Machine Classifier. |
Regressors¶
Regressors are components that output a predicted target value.
Autoregressive Integrated Moving Average Model. The three parameters (p, d, q) are the AR order, the degree of differencing, and the MA order. More information here: https://www.statsmodels.org/devel/generated/statsmodels.tsa.arima_model.ARIMA.html. |
|
CatBoost Regressor, a regressor that uses gradient-boosting on decision trees. CatBoost is an open-source library and natively supports categorical features. |
|
Elastic Net Regressor. |
|
Linear Regressor. |
|
Extra Trees Regressor. |
|
Random Forest Regressor. |
|
XGBoost Regressor. |
|
Baseline regressor that uses a simple strategy to make predictions. This is useful as a simple baseline regressor to compare with other regressors. |
|
Time series estimator that predicts using the naive forecasting approach. |
|
Scikit-learn Stacked Ensemble Regressor. |
|
Stacked Ensemble Regressor. |
|
Decision Tree Regressor. |
|
LightGBM Regressor. |
|
Support Vector Machine Regressor. |
Model Understanding¶
Utility Methods¶
Confusion matrix for binary and multiclass classification. |
|
Normalizes a confusion matrix. |
|
Given labels and binary classifier predicted probabilities, compute and return the data representing a precision-recall curve. |
|
Given labels and classifier predicted probabilities, compute and return the data representing a Receiver Operating Characteristic (ROC) curve. Works with binary or multiclass problems. |
|
Calculates permutation importance for features. |
|
Calculates permutation importance for one column in the original dataframe. |
|
Computes objective score as a function of potential binary classification decision thresholds for a fitted binary classification pipeline. |
|
Get the data needed for the prediction_vs_actual_over_time plot. |
|
Calculates one or two-way partial dependence. |
|
Combines y_true and y_pred into a single dataframe and adds a column for outliers. Used in graph_prediction_vs_actual(). |
|
Returns a dataframe showing the features with the greatest predictive power for a linear model. |
|
Get the transformed output after fitting X to the embedded space using t-SNE. |
Graph Utility Methods¶
Generate and display a precision-recall plot. |
|
Generate and display a Receiver Operating Characteristic (ROC) plot for binary and multiclass classification problems. |
|
Generate and display a confusion matrix plot. |
|
Generate a bar graph of the pipeline’s permutation importance. |
|
Generates a plot graphing objective score vs. decision thresholds for a fitted binary classification pipeline. |
|
Generate a scatter plot comparing the true and predicted values. Used for regression plotting. |
|
Plot the target values and predictions against time on the x-axis. |
|
Create an one-way or two-way partial dependence plot. |
|
Plot high dimensional data into lower dimensional space using t-SNE. |
Prediction Explanations¶
Creates a report summarizing the top contributing features for each data point in the input features. |
|
Creates a report summarizing the top contributing features for the best and worst points in the dataset as measured by error to true labels. |
Objectives¶
Objective Base Classes¶
Base class for all objectives. |
|
Base class for all binary classification objectives. |
|
Base class for all multiclass classification objectives. |
|
Base class for all regression objectives. |
Domain-Specific Objectives¶
Score the percentage of money lost of the total transaction amount process due to fraud. |
|
Lead scoring. |
|
Score using a cost-benefit matrix. Scores quantify the benefits of a given value, so greater numeric scores represents a better score. Costs and scores can be negative, indicating that a value is not beneficial. For example, in the case of monetary profit, a negative cost and/or score represents loss of cash flow. |
Classification Objectives¶
Accuracy score for binary classification. |
|
Accuracy score for multiclass classification. |
|
AUC score for binary classification. |
|
AUC score for multiclass classification using macro averaging. |
|
AUC score for multiclass classification using micro averaging. |
|
AUC Score for multiclass classification using weighted averaging. |
|
Gini coefficient for binary classification. |
|
Balanced accuracy score for binary classification. |
|
Balanced accuracy score for multiclass classification. |
|
F1 score for binary classification. |
|
F1 score for multiclass classification using micro averaging. |
|
F1 score for multiclass classification using macro averaging. |
|
F1 score for multiclass classification using weighted averaging. |
|
Log Loss for binary classification. |
|
Log Loss for multiclass classification. |
|
Matthews correlation coefficient for binary classification. |
|
Matthews correlation coefficient for multiclass classification. |
|
Precision score for binary classification. |
|
Precision score for multiclass classification using micro averaging. |
|
Precision score for multiclass classification using macro-averaging. |
|
Precision score for multiclass classification using weighted averaging. |
|
Recall score for binary classification. |
|
Recall score for multiclass classification using micro averaging. |
|
Recall score for multiclass classification using macro averaging. |
|
Recall score for multiclass classification using weighted averaging. |
Regression Objectives¶
Coefficient of determination for regression. |
|
Mean absolute error for regression. |
|
Mean absolute percentage error for time series regression. Scaled by 100 to return a percentage. |
|
Mean squared error for regression. |
|
Mean squared log error for regression. |
|
Median absolute error for regression. |
|
Maximum residual error for regression. |
|
Explained variance score for regression. |
|
Root mean squared error for regression. |
|
Root mean squared log error for regression. |
Objective Utils¶
Get a list of the names of all objectives. |
|
Returns all core objective instances associated with the given problem type. |
|
Get a list of all valid core objectives. |
|
Get non-core objective classes. |
|
Returns the Objective class corresponding to a given objective name. |
Problem Types¶
Handles problem_type by either returning the ProblemTypes or converting from a str. |
|
Determine the type of problem is being solved based on the targets (binary vs multiclass classification, regression). Ignores missing and null data. |
|
Enum defining the supported types of machine learning problems. |
Model Family¶
Handles model_family by either returning the ModelFamily or converting from a string. |
|
Enum for family of machine learning models. |
Tuners¶
Base Tuner class. |
|
Bayesian Optimizer. |
|
Grid Search Optimizer, which generates all of the possible points to search for using a grid. |
|
Random Search Optimizer. |
Data Checks¶
Data Check Classes¶
Base class for all data checks. |
|
Check if the target data contains missing or invalid values. |
|
Check if there are any highly-null columns and rows in the input. |
|
Check if any of the features are likely to be ID columns. |
|
Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation. |
|
Checks if there are any outliers in input data by using IQR to determine score anomalies. |
|
Check if the target or any of the features have no variance. |
|
Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems. |
|
Check if any set features are likely to be multicollinear. |
|
Check each column in the input for datetime features and will issue an error if NaN values are present. |
|
Checks each column in the input for natural language features and will issue an error if NaN values are present. |
|
Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators. |
|
A collection of data checks. |
|
A collection of basic data checks that is used by AutoML by default. |
Data Check Messages¶
Base class for a message returned by a DataCheck, tagged by name. |
|
DataCheckMessage subclass for errors returned by data checks. |
|
DataCheckMessage subclass for warnings returned by data checks. |
Data Check Message Types¶
Enum for type of data check message: WARNING or ERROR. |
Data Check Message Codes¶
Enum for data check message code. |
Utils¶
General Utils¶
Attempts to import the requested library by name. If the import fails, raises an ImportError or warning. |
|
Converts a string describing a length of time to its length in seconds. |
|
Generates a numpy.random.RandomState instance using seed. |
|
Given a numpy.random.RandomState object, generate an int representing a seed value for another random number generator. Or, if given an int, return that int. |
|
Pad the beginning num_to_pad rows with nans. |
|
Drop rows that have any NaNs in all dataframes or series. |
|
Create a Woodwork structure from the given list, pandas, or numpy input, with specified types for columns. If a column’s type is not specified, it will be inferred by Woodwork. |
|
Saves fig to filepath if specified, or to a default location if not. |
|
Checks if the given DataFrame contains only numeric values. |
|
Get importable subclasses of a base class. Used to list all of our estimators, transformers, components and pipelines dynamically. |