API Reference#

Demo Datasets#

`load_breast_cancer`	Load breast cancer dataset. Binary classification problem.
`load_diabetes`	Load diabetes dataset. Used for regression problem.
`load_fraud`	Load credit card fraud dataset.
`load_wine`	Load wine dataset. Multiclass problem.
`load_churn`	Load churn dataset, which can be used for binary classification problems.

Preprocessing#

Utilities to preprocess data before using evalml.

`load_data`	Load features and target from file.
`target_distribution`	Get the target distributions.
`number_of_features`	Get the number of features of each specific dtype in a DataFrame.
`split_data`	Split data into train and test sets.

Exceptions#

`MethodPropertyNotFoundError`	Exception to raise when a class is does not have an expected method or property.
`PipelineNotFoundError`	An exception raised when a particular pipeline is not found in automl search results.
`ObjectiveNotFoundError`	Exception to raise when specified objective does not exist.
`MissingComponentError`	An exception raised when a component is not found in all_components().
`ComponentNotYetFittedError`	An exception to be raised when predict/predict_proba/transform is called on a component without fitting first.
`PipelineNotYetFittedError`	An exception to be raised when predict/predict_proba/transform is called on a pipeline without fitting first.
`AutoMLSearchException`	Exception raised when all pipelines in an automl batch return a score of NaN for the primary objective.
`PipelineScoreError`	An exception raised when a pipeline errors while scoring any objective in a list of objectives.
`DataCheckInitError`	Exception raised when a data check can't initialize with the parameters given.
`NullsInColumnWarning`	Warning thrown when there are null values in the column of interest.

AutoML#

AutoML Search Interface#

AutoMLSearch

Automated Pipeline search.

AutoML Utils#

`search`	Given data and configuration, run an automl search.
`get_default_primary_search_objective`	Get the default primary search objective for a problem type.
`make_data_splitter`	Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search.

AutoML Algorithm Classes#

`AutoMLAlgorithm`	Base class for the AutoML algorithms which power EvalML.
`IterativeAlgorithm`	An automl algorithm which first fits a base round of pipelines with default parameters, then does a round of parameter tuning on each pipeline in order of performance.

AutoML Callbacks#

`silent_error_callback`	No-op.
`log_error_callback`	Logs the exception thrown as an error.
`raise_error_callback`	Raises the exception thrown by the AutoMLSearch object.

AutoML Engines#

`SequentialEngine`	The default engine for the AutoML search.
`CFEngine`	The concurrent.futures (CF) engine.
`DaskEngine`	The dask engine.

Pipelines#

Pipeline Base Classes#

`PipelineBase`	Machine learning pipeline.
`ClassificationPipeline`	Pipeline subclass for all classification pipelines.
`BinaryClassificationPipeline`	Pipeline subclass for all binary classification pipelines.
`MulticlassClassificationPipeline`	Pipeline subclass for all multiclass classification pipelines.
`RegressionPipeline`	Pipeline subclass for all regression pipelines.
`TimeSeriesClassificationPipeline`	Pipeline base class for time series classification problems.
`TimeSeriesBinaryClassificationPipeline`	Pipeline base class for time series binary classification problems.
`TimeSeriesMulticlassClassificationPipeline`	Pipeline base class for time series multiclass classification problems.
`TimeSeriesRegressionPipeline`	Pipeline base class for time series regression problems.

Pipeline Utils#

`make_pipeline`	Given input data, target data, an estimator class and the problem type, generates a pipeline class with a preprocessing chain which was recommended based on the inputs. The pipeline will be a subclass of the appropriate pipeline base class for the specified problem_type.
`generate_pipeline_code`	Creates and returns a string that contains the Python imports and code required for running the EvalML pipeline.
`rows_of_interest`	Get the row indices of the data that are closest to the threshold. Works only for binary classification problems and pipelines.

Component Graphs#

ComponentGraph

Component graph for a pipeline as a directed acyclic graph (DAG).

Components#

Component Base Classes#

Components represent a step in a pipeline.

`ComponentBase`	Base class for all components.
`Transformer`	A component that may or may not need fitting that transforms data. These components are used before an estimator.
`Estimator`	A component that fits and predicts given data.

Component Utils#

`allowed_model_families`	List the model types allowed for a particular problem type.
`get_estimators`	Returns the estimators allowed for a particular problem type.
`generate_component_code`	Creates and returns a string that contains the Python imports and code required for running the EvalML component.

Transformers#

Transformers are components that take in data as input and output transformed data.

`DropColumns`	Drops specified columns in input data.
`SelectColumns`	Selects specified columns in input data.
`SelectByType`	Selects columns by specified Woodwork logical type or semantic tag in input data.
`OneHotEncoder`	A transformer that encodes categorical features in a one-hot numeric array.
`TargetEncoder`	A transformer that encodes categorical features into target encodings.
`PerColumnImputer`	Imputes missing data according to a specified imputation strategy per column.
`Imputer`	Imputes missing data according to a specified imputation strategy.
`SimpleImputer`	Imputes missing data according to a specified imputation strategy. Natural language columns are ignored.
`StandardScaler`	A transformer that standardizes input features by removing the mean and scaling to unit variance.
`RFRegressorSelectFromModel`	Selects top features based on importance weights using a Random Forest regressor.
`RFClassifierSelectFromModel`	Selects top features based on importance weights using a Random Forest classifier.
`DropNullColumns`	Transformer to drop features whose percentage of NaN values exceeds a specified threshold.
`DateTimeFeaturizer`	Transformer that can automatically extract features from datetime columns.
`NaturalLanguageFeaturizer`	Transformer that can automatically featurize text columns using featuretools' nlp_primitives.
`TimeSeriesFeaturizer`	Transformer that delays input features and target variable for time series problems.
`DFSTransformer`	Featuretools DFS component that generates features for the input features.
`PolynomialDetrender`	Removes trends from time series by fitting a polynomial to the data.
`Undersampler`	Initializes an undersampling transformer to downsample the majority classes in the dataset.
`Oversampler`	SMOTE Oversampler component. Will automatically select whether to use SMOTE, SMOTEN, or SMOTENC based on inputs to the component.

Estimators#

Classifiers#

Classifiers are components that output a predicted class label.

`CatBoostClassifier`	CatBoost Classifier, a classifier that uses gradient-boosting on decision trees. CatBoost is an open-source library and natively supports categorical features.
`ElasticNetClassifier`	Elastic Net Classifier. Uses Logistic Regression with elasticnet penalty as the base estimator.
`ExtraTreesClassifier`	Extra Trees Classifier.
`RandomForestClassifier`	Random Forest Classifier.
`LightGBMClassifier`	LightGBM Classifier.
`LogisticRegressionClassifier`	Logistic Regression Classifier.
`XGBoostClassifier`	XGBoost Classifier.
`BaselineClassifier`	Classifier that predicts using the specified strategy.
`StackedEnsembleClassifier`	Stacked Ensemble Classifier.
`DecisionTreeClassifier`	Decision Tree Classifier.
`KNeighborsClassifier`	K-Nearest Neighbors Classifier.
`SVMClassifier`	Support Vector Machine Classifier.
`VowpalWabbitBinaryClassifier`	Vowpal Wabbit Binary Classifier.
`VowpalWabbitMulticlassClassifier`	Vowpal Wabbit Multiclass Classifier.

Regressors#

Regressors are components that output a predicted target value.

`ARIMARegressor`	Autoregressive Integrated Moving Average Model. The three parameters (p, d, q) are the AR order, the degree of differencing, and the MA order. More information here: https://www.statsmodels.org/devel/generated/statsmodels.tsa.arima.model.ARIMA.html.
`CatBoostRegressor`	CatBoost Regressor, a regressor that uses gradient-boosting on decision trees. CatBoost is an open-source library and natively supports categorical features.
`ElasticNetRegressor`	Elastic Net Regressor.
`ExponentialSmoothingRegressor`	Holt-Winters Exponential Smoothing Forecaster.
`LinearRegressor`	Linear Regressor.
`ExtraTreesRegressor`	Extra Trees Regressor.
`RandomForestRegressor`	Random Forest Regressor.
`XGBoostRegressor`	XGBoost Regressor.
`BaselineRegressor`	Baseline regressor that uses a simple strategy to make predictions. This is useful as a simple baseline regressor to compare with other regressors.
`TimeSeriesBaselineEstimator`	Time series estimator that predicts using the naive forecasting approach.
`StackedEnsembleRegressor`	Stacked Ensemble Regressor.
`DecisionTreeRegressor`	Decision Tree Regressor.
`LightGBMRegressor`	LightGBM Regressor.
`SVMRegressor`	Support Vector Machine Regressor.
`VowpalWabbitRegressor`	Vowpal Wabbit Regressor.

Model Understanding#

Utility Methods#

`confusion_matrix`	Confusion matrix for binary and multiclass classification.
`normalize_confusion_matrix`	Normalizes a confusion matrix.
`precision_recall_curve`	Given labels and binary classifier predicted probabilities, compute and return the data representing a precision-recall curve.
`roc_curve`	Given labels and classifier predicted probabilities, compute and return the data representing a Receiver Operating Characteristic (ROC) curve. Works with binary or multiclass problems.
`calculate_permutation_importance`	Calculates permutation importance for features.
`calculate_permutation_importance_one_column`	Calculates permutation importance for one column in the original dataframe.
`binary_objective_vs_threshold`	Computes objective score as a function of potential binary classification decision thresholds for a fitted binary classification pipeline.
`get_prediction_vs_actual_over_time_data`	Get the data needed for the prediction_vs_actual_over_time plot.
`partial_dependence`	Calculates one or two-way partial dependence.
`get_prediction_vs_actual_data`	Combines y_true and y_pred into a single dataframe and adds a column for outliers. Used in graph_prediction_vs_actual().
`get_linear_coefficients`	Returns a dataframe showing the features with the greatest predictive power for a linear model.
`t_sne`	Get the transformed output after fitting X to the embedded space using t-SNE.
`find_confusion_matrix_per_thresholds`	Gets the confusion matrix and histogram bins for each threshold as well as the best threshold per objective. Only works with Binary Classification Pipelines.

Graph Utility Methods#

`graph_precision_recall_curve`	Generate and display a precision-recall plot.
`graph_roc_curve`	Generate and display a Receiver Operating Characteristic (ROC) plot for binary and multiclass classification problems.
`graph_confusion_matrix`	Generate and display a confusion matrix plot.
`graph_permutation_importance`	Generate a bar graph of the pipeline's permutation importance.
`graph_binary_objective_vs_threshold`	Generates a plot graphing objective score vs. decision thresholds for a fitted binary classification pipeline.
`graph_prediction_vs_actual`	Generate a scatter plot comparing the true and predicted values. Used for regression plotting.
`graph_prediction_vs_actual_over_time`	Plot the target values and predictions against time on the x-axis.
`graph_partial_dependence`	Create an one-way or two-way partial dependence plot.
`graph_t_sne`	Plot high dimensional data into lower dimensional space using t-SNE.

Prediction Explanations#

`explain_predictions`	Creates a report summarizing the top contributing features for each data point in the input features.
`explain_predictions_best_worst`	Creates a report summarizing the top contributing features for the best and worst points in the dataset as measured by error to true labels.

Objectives#

Objective Base Classes#

`ObjectiveBase`	Base class for all objectives.
`BinaryClassificationObjective`	Base class for all binary classification objectives.
`MulticlassClassificationObjective`	Base class for all multiclass classification objectives.
`RegressionObjective`	Base class for all regression objectives.

Domain-Specific Objectives#

`FraudCost`	Score the percentage of money lost of the total transaction amount process due to fraud.
`LeadScoring`	Lead scoring.
`CostBenefitMatrix`	Score using a cost-benefit matrix. Scores quantify the benefits of a given value, so greater numeric scores represents a better score. Costs and scores can be negative, indicating that a value is not beneficial. For example, in the case of monetary profit, a negative cost and/or score represents loss of cash flow.

Classification Objectives#

`AccuracyBinary`	Accuracy score for binary classification.
`AccuracyMulticlass`	Accuracy score for multiclass classification.
`AUC`	AUC score for binary classification.
`AUCMacro`	AUC score for multiclass classification using macro averaging.
`AUCMicro`	AUC score for multiclass classification using micro averaging.
`AUCWeighted`	AUC Score for multiclass classification using weighted averaging.
`Gini`	Gini coefficient for binary classification.
`BalancedAccuracyBinary`	Balanced accuracy score for binary classification.
`BalancedAccuracyMulticlass`	Balanced accuracy score for multiclass classification.
`F1`	F1 score for binary classification.
`F1Micro`	F1 score for multiclass classification using micro averaging.
`F1Macro`	F1 score for multiclass classification using macro averaging.
`F1Weighted`	F1 score for multiclass classification using weighted averaging.
`LogLossBinary`	Log Loss for binary classification.
`LogLossMulticlass`	Log Loss for multiclass classification.
`MCCBinary`	Matthews correlation coefficient for binary classification.
`MCCMulticlass`	Matthews correlation coefficient for multiclass classification.
`Precision`	Precision score for binary classification.
`PrecisionMicro`	Precision score for multiclass classification using micro averaging.
`PrecisionMacro`	Precision score for multiclass classification using macro-averaging.
`PrecisionWeighted`	Precision score for multiclass classification using weighted averaging.
`Recall`	Recall score for binary classification.
`RecallMicro`	Recall score for multiclass classification using micro averaging.
`RecallMacro`	Recall score for multiclass classification using macro averaging.
`RecallWeighted`	Recall score for multiclass classification using weighted averaging.

Regression Objectives#

`R2`	Coefficient of determination for regression.
`MAE`	Mean absolute error for regression.
`MAPE`	Mean absolute percentage error for time series regression. Scaled by 100 to return a percentage.
`MSE`	Mean squared error for regression.
`MeanSquaredLogError`	Mean squared log error for regression.
`MedianAE`	Median absolute error for regression.
`MaxError`	Maximum residual error for regression.
`ExpVariance`	Explained variance score for regression.
`RootMeanSquaredError`	Root mean squared error for regression.
`RootMeanSquaredLogError`	Root mean squared log error for regression.

Objective Utils#

`get_all_objective_names`	Get a list of the names of all objectives.
`get_core_objectives`	Returns all core objective instances associated with the given problem type.
`get_core_objective_names`	Get a list of all valid core objectives.
`get_non_core_objectives`	Get non-core objective classes.
`get_objective`	Returns the Objective class corresponding to a given objective name.

Problem Types#

`handle_problem_types`	Handles problem_type by either returning the ProblemTypes or converting from a str.
`detect_problem_type`	Determine the type of problem is being solved based on the targets (binary vs multiclass classification, regression). Ignores missing and null data.
`ProblemTypes`	Enum defining the supported types of machine learning problems.

Model Family#

`handle_model_family`	Handles model_family by either returning the ModelFamily or converting from a string.
`ModelFamily`	Enum for family of machine learning models.

Tuners#

`Tuner`	Base Tuner class.
`SKOptTuner`	Bayesian Optimizer.
`GridSearchTuner`	Grid Search Optimizer, which generates all of the possible points to search for using a grid.
`RandomSearchTuner`	Random Search Optimizer.

Data Checks#

Data Check Classes#

`DataCheck`	Base class for all data checks.
`InvalidTargetDataCheck`	Check if the target data is considered invalid.
`NullDataCheck`	Check if there are any highly-null numerical, boolean, categorical, natural language, and unknown columns and rows in the input.
`IDColumnsDataCheck`	Check if any of the features are likely to be ID columns.
`TargetLeakageDataCheck`	Check if any of the features are highly correlated with the target by using mutual information or Pearson correlation.
`OutliersDataCheck`	Checks if there are any outliers in input data by using IQR to determine score anomalies.
`NoVarianceDataCheck`	Check if the target or any of the features have no variance.
`ClassImbalanceDataCheck`	Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.
`MulticollinearityDataCheck`	Check if any set features are likely to be multicollinear.
`DateTimeFormatDataCheck`	Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators.
`TimeSeriesParametersDataCheck`	Checks whether the time series parameters are compatible with data splitting.
`TimeSeriesSplittingDataCheck`	Checks whether the time series target data is compatible with splitting.
`DataChecks`	A collection of data checks.
`DefaultDataChecks`	A collection of basic data checks that is used by AutoML by default.

Data Check Messages#

`DataCheckMessage`	Base class for a message returned by a DataCheck, tagged by name.
`DataCheckError`	DataCheckMessage subclass for errors returned by data checks.
`DataCheckWarning`	DataCheckMessage subclass for warnings returned by data checks.

Data Check Message Types#

DataCheckMessageType

Enum for type of data check message: WARNING or ERROR.

Data Check Message Codes#

DataCheckMessageCode

Enum for data check message code.

Utils#

General Utils#

`import_or_raise`	Attempts to import the requested library by name. If the import fails, raises an ImportError or warning.
`convert_to_seconds`	Converts a string describing a length of time to its length in seconds.
`get_random_state`	Generates a numpy.random.RandomState instance using seed.
`get_random_seed`	Given a numpy.random.RandomState object, generate an int representing a seed value for another random number generator. Or, if given an int, return that int.
`pad_with_nans`	Pad the beginning num_to_pad rows with nans.
`drop_rows_with_nans`	Drop rows that have any NaNs in all dataframes or series.
`infer_feature_types`	Create a Woodwork structure from the given list, pandas, or numpy input, with specified types for columns. If a column's type is not specified, it will be inferred by Woodwork.
`save_plot`	Saves fig to filepath if specified, or to a default location if not.
`is_all_numeric`	Checks if the given DataFrame contains only numeric values.
`get_importable_subclasses`	Get importable subclasses of a base class. Used to list all of our estimators, transformers, components and pipelines dynamically.

FAQ

Evalml