Pipelines#

EvalML pipelines.

Submodules#

Package Contents#

Classes Summary#

`ARIMARegressor`	Autoregressive Integrated Moving Average Model. The three parameters (p, d, q) are the AR order, the degree of differencing, and the MA order. More information here: https://www.statsmodels.org/devel/generated/statsmodels.tsa.arima.model.ARIMA.html.
`BinaryClassificationPipeline`	Pipeline subclass for all binary classification pipelines.
`CatBoostClassifier`	CatBoost Classifier, a classifier that uses gradient-boosting on decision trees. CatBoost is an open-source library and natively supports categorical features.
`CatBoostRegressor`	CatBoost Regressor, a regressor that uses gradient-boosting on decision trees. CatBoost is an open-source library and natively supports categorical features.
`ClassificationPipeline`	Pipeline subclass for all classification pipelines.
`ComponentGraph`	Component graph for a pipeline as a directed acyclic graph (DAG).
`DecisionTreeClassifier`	Decision Tree Classifier.
`DecisionTreeRegressor`	Decision Tree Regressor.
`DFSTransformer`	Featuretools DFS component that generates features for the input features.
`DropNaNRowsTransformer`	Transformer to drop rows with NaN values.
`ElasticNetClassifier`	Elastic Net Classifier. Uses Logistic Regression with elasticnet penalty as the base estimator.
`ElasticNetRegressor`	Elastic Net Regressor.
`Estimator`	A component that fits and predicts given data.
`ExponentialSmoothingRegressor`	Holt-Winters Exponential Smoothing Forecaster.
`ExtraTreesClassifier`	Extra Trees Classifier.
`ExtraTreesRegressor`	Extra Trees Regressor.
`FeatureSelector`	Selects top features based on importance weights.
`Imputer`	Imputes missing data according to a specified imputation strategy.
`KNeighborsClassifier`	K-Nearest Neighbors Classifier.
`LightGBMClassifier`	LightGBM Classifier.
`LightGBMRegressor`	LightGBM Regressor.
`LinearRegressor`	Linear Regressor.
`LogisticRegressionClassifier`	Logistic Regression Classifier.
`MulticlassClassificationPipeline`	Pipeline subclass for all multiclass classification pipelines.
`MultiseriesRegressionPipeline`	Pipeline base class for multiseries time series regression problems.
`OneHotEncoder`	A transformer that encodes categorical features in a one-hot numeric array.
`OrdinalEncoder`	A transformer that encodes ordinal features as an array of ordinal integers representing the relative order of categories.
`PerColumnImputer`	Imputes missing data according to a specified imputation strategy per column.
`PipelineBase`	Machine learning pipeline.
`ProphetRegressor`	Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
`RandomForestClassifier`	Random Forest Classifier.
`RandomForestRegressor`	Random Forest Regressor.
`RegressionPipeline`	Pipeline subclass for all regression pipelines.
`RFClassifierSelectFromModel`	Selects top features based on importance weights using a Random Forest classifier.
`RFRegressorSelectFromModel`	Selects top features based on importance weights using a Random Forest regressor.
`SimpleImputer`	Imputes missing data according to a specified imputation strategy. Natural language columns are ignored.
`StackedEnsembleBase`	Stacked Ensemble Base Class.
`StackedEnsembleClassifier`	Stacked Ensemble Classifier.
`StackedEnsembleRegressor`	Stacked Ensemble Regressor.
`StandardScaler`	A transformer that standardizes input features by removing the mean and scaling to unit variance.
`SVMClassifier`	Support Vector Machine Classifier.
`SVMRegressor`	Support Vector Machine Regressor.
`TargetEncoder`	A transformer that encodes categorical features into target encodings.
`TimeSeriesBinaryClassificationPipeline`	Pipeline base class for time series binary classification problems.
`TimeSeriesClassificationPipeline`	Pipeline base class for time series classification problems.
`TimeSeriesFeaturizer`	Transformer that delays input features and target variable for time series problems.
`TimeSeriesImputer`	Imputes missing data according to a specified timeseries-specific imputation strategy.
`TimeSeriesMulticlassClassificationPipeline`	Pipeline base class for time series multiclass classification problems.
`TimeSeriesRegressionPipeline`	Pipeline base class for time series regression problems.
`TimeSeriesRegularizer`	Transformer that regularizes an inconsistently spaced datetime column.
`Transformer`	A component that may or may not need fitting that transforms data. These components are used before an estimator.
`VARMAXRegressor`	Vector Autoregressive Moving Average with eXogenous regressors model. The two parameters (p, q) are the AR order and the MA order. More information here: https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.varmax.VARMAX.html.
`XGBoostClassifier`	XGBoost Classifier.
`XGBoostRegressor`	XGBoost Regressor.

Contents#

class evalml.pipelines.ARIMARegressor(time_index: Optional[Hashable] = None, trend: Optional[str] = None, start_p: int = 2, d: int = 0, start_q: int = 2, max_p: int = 5, max_d: int = 2, max_q: int = 5, seasonal: bool = True, sp: int = 1, n_jobs: int = -1, random_seed: Union[int, float] = 0, maxiter: int = 10, use_covariates: bool = True, **kwargs)[source]#

Autoregressive Integrated Moving Average Model. The three parameters (p, d, q) are the AR order, the degree of differencing, and the MA order. More information here: https://www.statsmodels.org/devel/generated/statsmodels.tsa.arima.model.ARIMA.html.

Currently ARIMARegressor isn’t supported via conda install. It’s recommended that it be installed via PyPI.

Parameters

time_index (str) – Specifies the name of the column in X that provides the datetime objects. Defaults to None.
trend (str) – Controls the deterministic trend. Options are [‘n’, ‘c’, ‘t’, ‘ct’] where ‘c’ is a constant term, ‘t’ indicates a linear trend, and ‘ct’ is both. Can also be an iterable when defining a polynomial, such as [1, 1, 0, 1].
start_p (int) – Minimum Autoregressive order. Defaults to 2.
d (int) – Minimum Differencing degree. Defaults to 0.
start_q (int) – Minimum Moving Average order. Defaults to 2.
max_p (int) – Maximum Autoregressive order. Defaults to 5.
max_d (int) – Maximum Differencing degree. Defaults to 2.
max_q (int) – Maximum Moving Average order. Defaults to 5.
seasonal (boolean) – Whether to fit a seasonal model to ARIMA. Defaults to True.
sp (int or str) – Period for seasonal differencing, specifically the number of periods in each season. If “detect”, this model will automatically detect this parameter (given the time series is a standard frequency) and will fall back to 1 (no seasonality) if it cannot be detected. Defaults to 1.
n_jobs (int or None) – Non-negative integer describing level of parallelism used for pipelines. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “start_p”: Integer(1, 3), “d”: Integer(0, 2), “start_q”: Integer(1, 3), “max_p”: Integer(3, 10), “max_d”: Integer(2, 5), “max_q”: Integer(3, 10), “seasonal”: [True, False],}
max_cols	7
max_rows	1000
model_family	ModelFamily.ARIMA
modifies_features	True
modifies_target	False
name	ARIMA Regressor
supported_problem_types	[ProblemTypes.TIME_SERIES_REGRESSION]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns array of 0's with a length of 1 as feature_importance is not defined for ARIMA regressor.
`fit`	Fits ARIMA regressor to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted ARIMARegressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using fitted ARIMA regressor.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → numpy.ndarray#: Returns array of 0’s with a length of 1 as feature_importance is not defined for ARIMA regressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)[source]#

Fits ARIMA regressor to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

Raises

ValueError – If y was not passed in.

get_prediction_intervals(self, X: pandas.DataFrame, y: pandas.Series = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series][source]#

Find the prediction intervals using the fitted ARIMARegressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Optional.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Not used for ARIMA regressor.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None) → pandas.Series[source]#

Make predictions using fitted ARIMA regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data.

Returns

Predicted values.

Return type

pd.Series

Raises

ValueError – If X was passed to fit but not passed in predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.BinaryClassificationPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline subclass for all binary classification pipelines.

Parameters

component_graph (ComponentGraph, list, dict) – ComponentGraph instance, list of components in order, or dictionary of components. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
custom_name (str) – Custom name for the pipeline. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> pipeline = BinaryClassificationPipeline(component_graph=["Simple Imputer", "Logistic Regression Classifier"],
...                                         parameters={"Logistic Regression Classifier": {"penalty": "elasticnet",
...                                                                                        "solver": "liblinear"}},
...                                         custom_name="My Binary Pipeline")
...
>>> assert pipeline.custom_name == "My Binary Pipeline"
>>> assert pipeline.component_graph.component_dict.keys() == {'Simple Imputer', 'Logistic Regression Classifier'}

The pipeline parameters will be chosen from the default parameters for every component, unless specific parameters were passed in as they were above.

>>> assert pipeline.parameters == {
...     'Simple Imputer': {'impute_strategy': 'most_frequent', 'fill_value': None},
...     'Logistic Regression Classifier': {'penalty': 'elasticnet',
...                                        'C': 1.0,
...                                        'n_jobs': -1,
...                                        'multi_class': 'auto',
...                                        'solver': 'liblinear'}}

Attributes

problem_type

ProblemTypes.BINARY

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`classes_`	Gets the class names for the pipeline. Will return None before pipeline is fit.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Build a classification model. For string and categorical targets, classes are sorted by sorted(set(y)) and then are mapped to values between 0 and n_classes-1.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`optimize_threshold`	Optimize the pipeline threshold given the objective to use. Only used for binary problems with objectives whose thresholds can be tuned.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels. Assumes that the column at index 1 represents the positive label case.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`threshold`	Threshold used to make a prediction. Defaults to None.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

property classes_(self)#: Gets the class names for the pipeline. Will return None before pipeline is fit.

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)#

Build a classification model. For string and categorical targets, classes are sorted by sorted(set(y)) and then are mapped to values between 0 and n_classes-1.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target training labels of length [n_samples]

Returns

self

Raises

ValueError – If the number of unique classes in y are not appropriate for the type of pipeline.
TypeError – If the dtype is boolean but pd.NA exists in the series.
Exception – For all other exceptions.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

x_edges specifies from which component feature data is being passed. y_edges specifies from which component target data is being passed. This can be used to build graphs across a variety of visualization tools. Template: {“Nodes”: {“component_name”: {“Name”: class_name, “Parameters”: parameters_attributes}, …}}, “x_edges”: [[from_component_name, to_component_name], [from_component_name, to_component_name], …], “y_edges”: [[from_component_name, to_component_name], [from_component_name, to_component_name], …]}

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

optimize_threshold(self, X, y, y_pred_proba, objective)#

Optimize the pipeline threshold given the objective to use. Only used for binary problems with objectives whose thresholds can be tuned.

Parameters

X (pd.DataFrame) – Input features.
y (pd.Series) – Input target values.
y_pred_proba (pd.Series) – The predicted probabilities of the target outputted by the pipeline.
objective (ObjectiveBase) – The objective to threshold with. Must have a tunable threshold.

Raises

ValueError – If objective is not optimizable.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)#

Make predictions using selected features.

Note: we cast y as ints first to address boolean values that may be returned from calculating predictions which we would not be able to otherwise transform if we originally had integer targets.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame) – Training data. Ignored. Only used for time series.
y_train (pd.Series) – Training labels. Ignored. Only used for time series.

Returns

Estimated labels.

Return type

pd.Series

predict_proba(self, X, X_train=None, y_train=None)[source]#

Make probability estimates for labels. Assumes that the column at index 1 represents the positive label case.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features]
X_train (pd.DataFrame or np.ndarray or None) – Training data. Ignored. Only used for time series.
y_train (pd.Series or None) – Training labels. Ignored. Only used for time series.

Returns

Probability estimates

Return type

pd.Series

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)#

Evaluate model performance on objectives.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features]
y (pd.Series) – True labels of length [n_samples]
objectives (list) – List of objectives to score
X_train (pd.DataFrame) – Training data. Ignored. Only used for time series.
y_train (pd.Series) – Training labels. Ignored. Only used for time series.

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

property threshold(self)#: Threshold used to make a prediction. Defaults to None.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series or None) – Targets corresponding to X. Optional.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Only used for time series.
y_train (pd.Series or None) – Training labels. Only used for time series.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.CatBoostClassifier(n_estimators=10, eta=0.03, max_depth=6, bootstrap_type=None, silent=True, allow_writing_files=False, random_seed=0, n_jobs=-1, **kwargs)[source]#

CatBoost Classifier, a classifier that uses gradient-boosting on decision trees. CatBoost is an open-source library and natively supports categorical features.

For more information, check out https://catboost.ai/

Parameters

n_estimators (float) – The maximum number of trees to build. Defaults to 10.
eta (float) – The learning rate. Defaults to 0.03.
max_depth (int) – The maximum tree depth for base learners. Defaults to 6.
bootstrap_type (string) – Defines the method for sampling the weights of objects. Available methods are ‘Bayesian’, ‘Bernoulli’, ‘MVS’. Defaults to None.
silent (boolean) – Whether to use the “silent” logging mode. Defaults to True.
allow_writing_files (boolean) – Whether to allow writing snapshot files while training. Defaults to False.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all processes. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “n_estimators”: Integer(4, 100), “eta”: Real(0.000001, 1), “max_depth”: Integer(4, 10),}
model_family	ModelFamily.CATBOOST
modifies_features	True
modifies_target	False
name	CatBoost Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance of fitted CatBoost classifier.
`fit`	Fits CatBoost classifier component to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using the fitted CatBoost classifier.
`predict_proba`	Make prediction probabilities using the fitted CatBoost classifier.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Feature importance of fitted CatBoost classifier.

fit(self, X, y=None)[source]#

Fits CatBoost classifier component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

This function takes the predictions of the fitted estimator and calculates the rolling standard deviation across all predictions using a window size of 5. The lower and upper predictions are determined by taking the percent point (quantile) function of the lower tail probability at each bound multiplied by the rolling standard deviation.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X)[source]#

Make predictions using the fitted CatBoost classifier.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series

predict_proba(self, X)[source]#

Make prediction probabilities using the fitted CatBoost classifier.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted probability values.
Return type: pd.DataFrame

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.CatBoostRegressor(n_estimators=10, eta=0.03, max_depth=6, bootstrap_type=None, silent=False, allow_writing_files=False, random_seed=0, n_jobs=-1, **kwargs)[source]#

CatBoost Regressor, a regressor that uses gradient-boosting on decision trees. CatBoost is an open-source library and natively supports categorical features.

For more information, check out https://catboost.ai/

Parameters

n_estimators (float) – The maximum number of trees to build. Defaults to 10.
eta (float) – The learning rate. Defaults to 0.03.
max_depth (int) – The maximum tree depth for base learners. Defaults to 6.
bootstrap_type (string) – Defines the method for sampling the weights of objects. Available methods are ‘Bayesian’, ‘Bernoulli’, ‘MVS’. Defaults to None.
silent (boolean) – Whether to use the “silent” logging mode. Defaults to True.
allow_writing_files (boolean) – Whether to allow writing snapshot files while training. Defaults to False.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all processes. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “n_estimators”: Integer(4, 100), “eta”: Real(0.000001, 1), “max_depth”: Integer(4, 10),}
model_family	ModelFamily.CATBOOST
modifies_features	True
modifies_target	False
name	CatBoost Regressor
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION, ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance of fitted CatBoost regressor.
`fit`	Fits CatBoost regressor component to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using the fitted CatBoost regressor.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Feature importance of fitted CatBoost regressor.

fit(self, X, y=None)[source]#

Fits CatBoost regressor component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X)[source]#

Make predictions using the fitted CatBoost regressor.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.DataFrame

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.ClassificationPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline subclass for all classification pipelines.

Parameters

component_graph (list or dict) – List of components in order. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
custom_name (str) – Custom name for the pipeline. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

problem_type

None

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`classes_`	Gets the class names for the pipeline. Will return None before pipeline is fit.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Build a classification model. For string and categorical targets, classes are sorted by sorted(set(y)) and then are mapped to values between 0 and n_classes-1.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

property classes_(self)#: Gets the class names for the pipeline. Will return None before pipeline is fit.

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)[source]#

Build a classification model. For string and categorical targets, classes are sorted by sorted(set(y)) and then are mapped to values between 0 and n_classes-1.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target training labels of length [n_samples]

Returns

self

Raises

ValueError – If the number of unique classes in y are not appropriate for the type of pipeline.
TypeError – If the dtype is boolean but pd.NA exists in the series.
Exception – For all other exceptions.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)[source]#

Make predictions using selected features.

Note: we cast y as ints first to address boolean values that may be returned from calculating predictions which we would not be able to otherwise transform if we originally had integer targets.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame) – Training data. Ignored. Only used for time series.
y_train (pd.Series) – Training labels. Ignored. Only used for time series.

Returns

Estimated labels.

Return type

pd.Series

predict_proba(self, X, X_train=None, y_train=None)[source]#

Make probability estimates for labels.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features]
X_train (pd.DataFrame or np.ndarray or None) – Training data. Ignored. Only used for time series.
y_train (pd.Series or None) – Training labels. Ignored. Only used for time series.

Returns

Probability estimates

Return type

pd.DataFrame

Raises

ValueError – If final component is not an estimator.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)[source]#

Evaluate model performance on objectives.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features]
y (pd.Series) – True labels of length [n_samples]
objectives (list) – List of objectives to score
X_train (pd.DataFrame) – Training data. Ignored. Only used for time series.
y_train (pd.Series) – Training labels. Ignored. Only used for time series.

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series or None) – Targets corresponding to X. Optional.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Only used for time series.
y_train (pd.Series or None) – Training labels. Only used for time series.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.ComponentGraph(component_dict=None, cached_data=None, random_seed=0)[source]#

Component graph for a pipeline as a directed acyclic graph (DAG).

Parameters

component_dict (dict) – A dictionary which specifies the components and edges between components that should be used to create the component graph. Defaults to None.
cached_data (dict) – A dictionary of nested cached data. If the hashes and components are in this cache, we skip fitting for these components. Expected to be of format {hash1: {component_name: trained_component, …}, hash2: {…}, …}. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Examples

>>> component_dict = {'Imputer': ['Imputer', 'X', 'y'],
...                   'Logistic Regression': ['Logistic Regression Classifier', 'Imputer.x', 'y']}
>>> component_graph = ComponentGraph(component_dict)
>>> assert component_graph.compute_order == ['Imputer', 'Logistic Regression']
...
...
>>> component_dict = {'Imputer': ['Imputer', 'X', 'y'],
...                   'OHE': ['One Hot Encoder', 'Imputer.x', 'y'],
...                   'estimator_1': ['Random Forest Classifier', 'OHE.x', 'y'],
...                   'estimator_2': ['Decision Tree Classifier', 'OHE.x', 'y'],
...                   'final': ['Logistic Regression Classifier', 'estimator_1.x', 'estimator_2.x', 'y']}
>>> component_graph = ComponentGraph(component_dict)

The default parameters for every component in the component graph.

>>> assert component_graph.default_parameters == {
...     'Imputer': {'categorical_impute_strategy': 'most_frequent',
...                 'numeric_impute_strategy': 'mean',
...                 'boolean_impute_strategy': 'most_frequent',
...                 'categorical_fill_value': None,
...                 'numeric_fill_value': None,
...                 'boolean_fill_value': None},
...     'One Hot Encoder': {'top_n': 10,
...                         'features_to_encode': None,
...                         'categories': None,
...                         'drop': 'if_binary',
...                         'handle_unknown': 'ignore',
...                         'handle_missing': 'error'},
...     'Random Forest Classifier': {'n_estimators': 100,
...                                  'max_depth': 6,
...                                  'n_jobs': -1},
...     'Decision Tree Classifier': {'criterion': 'gini',
...                                  'max_features': 'sqrt',
...                                  'max_depth': 6,
...                                  'min_samples_split': 2,
...                                  'min_weight_fraction_leaf': 0.0},
...     'Logistic Regression Classifier': {'penalty': 'l2',
...                                        'C': 1.0,
...                                        'n_jobs': -1,
...                                        'multi_class': 'auto',
...                                        'solver': 'lbfgs'}}

Methods

`compute_order`	The order that components will be computed or called in.
`default_parameters`	The default parameter dictionary for this pipeline.
`describe`	Outputs component graph details including component parameters.
`fit`	Fit each component in the graph.
`fit_and_transform_all_but_final`	Fit and transform all components save the final one, usually an estimator.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`generate_order`	Regenerated the topologically sorted order of the graph.
`get_component`	Retrieves a single component object from the graph.
`get_component_input_logical_types`	Get the logical types that are passed to the given component.
`get_estimators`	Gets a list of all the estimator components within this graph.
`get_inputs`	Retrieves all inputs for a given component.
`get_last_component`	Retrieves the component that is computed last in the graph, usually the final estimator.
`graph`	Generate an image representing the component graph.
`has_dfs`	Whether this component graph contains a DFSTransformer or not.
`instantiate`	Instantiates all uninstantiated components within the graph using the given parameters. An error will be raised if a component is already instantiated but the parameters dict contains arguments for that component.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`last_component_input_logical_types`	Get the logical types that are passed to the last component in the pipeline.
`predict`	Make predictions using selected features.
`transform`	Transform the input using the component graph.
`transform_all_but_final`	Transform all components save the final one, and gathers the data from any number of parents to get all the information that should be fed to the final component.

property compute_order(self)#: The order that components will be computed or called in.

property default_parameters(self)#

The default parameter dictionary for this pipeline.

Returns: Dictionary of all component default parameters.
Return type: dict

describe(self, return_dict=False)[source]#

Outputs component graph details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about component graph. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None
Return type: dict
Raises: ValueError – If the componentgraph is not instantiated

fit(self, X, y)[source]#

Fit each component in the graph.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

fit_and_transform_all_but_final(self, X, y)[source]#

Fit and transform all components save the final one, usually an estimator.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

Transformed features and target.

Return type

Tuple (pd.DataFrame, pd.Series)

fit_transform(self, X, y)[source]#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

classmethod generate_order(cls, component_dict)[source]#: Regenerated the topologically sorted order of the graph.

get_component(self, component_name)[source]#

Retrieves a single component object from the graph.

Parameters: component_name (str) – Name of the component to retrieve
Returns: ComponentBase object
Raises: ValueError – If the component is not in the graph.

get_component_input_logical_types(self, component_name)[source]#

Get the logical types that are passed to the given component.

Parameters

component_name (str) – Name of component in the graph

Returns

Dict - Mapping feature name to logical type instance.

Raises

ValueError – If the component is not in the graph.
ValueError – If the component graph as not been fitted

get_estimators(self)[source]#

Gets a list of all the estimator components within this graph.

Returns: All estimator objects within the graph.
Return type: list
Raises: ValueError – If the component graph is not yet instantiated.

get_inputs(self, component_name)[source]#

Retrieves all inputs for a given component.

Parameters: component_name (str) – Name of the component to look up.
Returns: List of inputs for the component to use.
Return type: list[str]
Raises: ValueError – If the component is not in the graph.

get_last_component(self)[source]#

Retrieves the component that is computed last in the graph, usually the final estimator.

Returns: ComponentBase object
Raises: ValueError – If the component graph has no edges.

graph(self, name=None, graph_format=None)[source]#

Generate an image representing the component graph.

Parameters

name (str) – Name of the graph. Defaults to None.
graph_format (str) – file format to save the graph in. Defaults to None.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.

property has_dfs(self)#: Whether this component graph contains a DFSTransformer or not.

instantiate(self, parameters=None)[source]#

Instantiates all uninstantiated components within the graph using the given parameters. An error will be raised if a component is already instantiated but the parameters dict contains arguments for that component.

Parameters: parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary {} or None implies using all default values for component parameters. If a component in the component graph is already instantiated, it will not use any of its parameters defined in this dictionary. Defaults to None.
Returns: self
Raises: ValueError – If component graph is already instantiated or if a component errored while instantiating.

inverse_transform(self, y)[source]#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y – (pd.Series): Final component features.
Returns: The target with inverse transformation applied.
Return type: pd.Series

property last_component_input_logical_types(self)#

Get the logical types that are passed to the last component in the pipeline.

Returns

Dict - Mapping feature name to logical type instance.

Raises

ValueError – If the component is not in the graph.
ValueError – If the component graph as not been fitted

predict(self, X)[source]#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Input features of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: ValueError – If final component is not an Estimator.

transform(self, X, y=None)[source]#

Transform the input using the component graph.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is not a Transformer.

transform_all_but_final(self, X, y=None)[source]#

Transform all components save the final one, and gathers the data from any number of parents to get all the information that should be fed to the final component.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples]. Defaults to None.

Returns

Transformed values.

Return type

pd.DataFrame

class evalml.pipelines.DecisionTreeClassifier(criterion='gini', max_features='sqrt', max_depth=6, min_samples_split=2, min_weight_fraction_leaf=0.0, random_seed=0, **kwargs)[source]#

Decision Tree Classifier.

Parameters

criterion ({"gini", "entropy"}) – The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Defaults to “gini”.
max_features (int, float or {"sqrt", "log2"}) –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features = n_features.
The search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
max_depth (int) – The maximum depth of the tree. Defaults to 6.
min_samples_split (int or float) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Defaults to 2.
min_weight_fraction_leaf (float) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Defaults to 0.0.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “criterion”: [“gini”, “entropy”], “max_features”: [“sqrt”, “log2”], “max_depth”: Integer(4, 10),}
model_family	ModelFamily.DECISION_TREE
modifies_features	True
modifies_target	False
name	Decision Tree Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.DecisionTreeRegressor(criterion='squared_error', max_features='sqrt', max_depth=6, min_samples_split=2, min_weight_fraction_leaf=0.0, random_seed=0, **kwargs)[source]#

Decision Tree Regressor.

Parameters

criterion ({"squared_error", "friedman_mse", "absolute_error", "poisson"}) –
The function to measure the quality of a split. Supported criteria are:
- ”squared_error” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node
- ”friedman_mse”, which uses mean squared error with Friedman”s improvement score for potential splits
- ”absolute_error” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node,
- ”poisson” which uses reduction in Poisson deviance to find splits.
max_features (int, float or {"sqrt", "log2"}) –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features = n_features.
The search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
max_depth (int) – The maximum depth of the tree. Defaults to 6.
min_samples_split (int or float) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Defaults to 2.
min_weight_fraction_leaf (float) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Defaults to 0.0.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “criterion”: [“squared_error”, “friedman_mse”, “absolute_error”], “max_features”: [“sqrt”, “log2”], “max_depth”: Integer(4, 10),}
model_family	ModelFamily.DECISION_TREE
modifies_features	True
modifies_target	False
name	Decision Tree Regressor
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION, ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.DFSTransformer(index='index', features=None, random_seed=0, **kwargs)[source]#

Featuretools DFS component that generates features for the input features.

Parameters

index (string) – The name of the column that contains the indices. If no column with this name exists, then featuretools.EntitySet() creates a column with this name to serve as the index column. Defaults to ‘index’.
random_seed (int) – Seed for the random number generator. Defaults to 0.
features (list) – List of features to run DFS on. Defaults to None. Features will only be computed if the columns used by the feature exist in the input and if the feature itself is not in input. If features is an empty list, no transformation will occur to inputted data.

Attributes

hyperparameter_ranges	{}
modifies_features	True
modifies_target	False
name	DFS Transformer
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`contains_pre_existing_features`	Determines whether or not features from a DFS Transformer match pipeline input features.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits the DFSTransformer Transformer component.
`fit_transform`	Fits on X and transforms X.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Computes the feature matrix for the input X using featuretools' dfs algorithm.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

static contains_pre_existing_features(dfs_features: Optional[List[featuretools.feature_base.FeatureBase]], input_feature_names: List[str], target: Optional[str] = None)[source]#

Determines whether or not features from a DFS Transformer match pipeline input features.

Parameters

dfs_features (Optional[List[FeatureBase]]) – List of features output from a DFS Transformer.
input_feature_names (List[str]) – List of input features into the DFS Transformer.
target (Optional[str]) – The target whose values we are trying to predict. This is used to know which column to ignore if the target column is present in the list of features in the DFS Transformer’s parameters.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the DFSTransformer Transformer component.

Parameters

X (pd.DataFrame, np.array) – The input data to transform, of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Computes the feature matrix for the input X using featuretools’ dfs algorithm.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data to transform. Has shape [n_samples, n_features]
y (pd.Series, optional) – Ignored.

Returns

Feature matrix

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.DropNaNRowsTransformer(parameters=None, component_obj=None, random_seed=0, **kwargs)[source]#

Transformer to drop rows with NaN values.

Parameters: random_seed (int) – Seed for the random number generator. Is not used by this component. Defaults to 0.

Attributes

hyperparameter_ranges	{}
modifies_features	True
modifies_target	True
name	Drop NaN Rows Transformer
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits component to data.
`fit_transform`	Fits on X and transforms X.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms data using fitted component.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data using fitted component.

Parameters

X (pd.DataFrame) – Features.
y (pd.Series, optional) – Target data.

Returns

Data with NaN rows dropped.

Return type

(pd.DataFrame, pd.Series)

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.ElasticNetClassifier(penalty='elasticnet', C=1.0, l1_ratio=0.15, multi_class='auto', solver='saga', n_jobs=-1, random_seed=0, **kwargs)[source]#

Elastic Net Classifier. Uses Logistic Regression with elasticnet penalty as the base estimator.

Parameters

penalty ({"l1", "l2", "elasticnet", "none"}) – The norm used in penalization. Defaults to “elasticnet”.
C (float) – Inverse of regularization strength. Must be a positive float. Defaults to 1.0.
l1_ratio (float) – The mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty=’elasticnet’. Setting l1_ratio=0 is equivalent to using penalty=’l2’, while setting l1_ratio=1 is equivalent to using penalty=’l1’. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2. Defaults to 0.15.
multi_class ({"auto", "ovr", "multinomial"}) – If the option chosen is “ovr”, then a binary problem is fit for each label. For “multinomial” the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. “multinomial” is unavailable when solver=”liblinear”. “auto” selects “ovr” if the data is binary, or if solver=”liblinear”, and otherwise selects “multinomial”. Defaults to “auto”.
solver ({"newton-cg", "lbfgs", "liblinear", "sag", "saga"}) –
Algorithm to use in the optimization problem. For small datasets, “liblinear” is a good choice, whereas “sag” and “saga” are faster for large ones. For multiclass problems, only “newton-cg”, “sag”, “saga” and “lbfgs” handle multinomial loss; “liblinear” is limited to one-versus-rest schemes.
- ”newton-cg”, “lbfgs”, “sag” and “saga” handle L2 or no penalty
- ”liblinear” and “saga” also handle L1 penalty
- ”saga” also supports “elasticnet” penalty
- ”liblinear” does not support setting penalty=’none’
Defaults to “saga”.
n_jobs (int) – Number of parallel threads used to run xgboost. Note that creating thread contention will significantly slow down the algorithm. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “C”: Real(0.01, 10), “l1_ratio”: Real(0, 1)}
model_family	ModelFamily.LINEAR_MODEL
modifies_features	True
modifies_target	False
name	Elastic Net Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance for fitted ElasticNet classifier.
`fit`	Fits ElasticNet classifier component to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Feature importance for fitted ElasticNet classifier.

fit(self, X, y)[source]#

Fits ElasticNet classifier component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.ElasticNetRegressor(alpha=0.0001, l1_ratio=0.15, max_iter=1000, random_seed=0, **kwargs)[source]#

Elastic Net Regressor.

Parameters

alpha (float) – Constant that multiplies the penalty terms. Defaults to 0.0001.
l1_ratio (float) – The mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty=’elasticnet’. Setting l1_ratio=0 is equivalent to using penalty=’l2’, while setting l1_ratio=1 is equivalent to using penalty=’l1’. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2. Defaults to 0.15.
max_iter (int) – The maximum number of iterations. Defaults to 1000.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “alpha”: Real(0, 1), “l1_ratio”: Real(0, 1),}
model_family	ModelFamily.LINEAR_MODEL
modifies_features	True
modifies_target	False
name	Elastic Net Regressor
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION, ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance for fitted ElasticNet regressor.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Feature importance for fitted ElasticNet regressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.Estimator(parameters: dict = None, component_obj: Type[evalml.pipelines.components.ComponentBase] = None, random_seed: Union[int, float] = 0, **kwargs)[source]#

A component that fits and predicts given data.

To implement a new Estimator, define your own class which is a subclass of Estimator, including a name and a list of acceptable ranges for any parameters to be tuned during the automl search (hyperparameters). Define an __init__ method which sets up any necessary state and objects. Make sure your __init__ only uses standard keyword arguments and calls super().__init__() with a parameters dict. You may also override the fit, transform, fit_transform and other methods in this class if appropriate.

To see some examples, check out the definitions of any Estimator component subclass.

Parameters

parameters (dict) – Dictionary of parameters for the component. Defaults to None.
component_obj (obj) – Third-party objects useful in component implementation. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

model_family	ModelFamily.NONE
modifies_features	True
modifies_target	False
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`model_family`	ModelFamily.NONE
`name`	Returns string name of this component.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`supported_problem_types`	Problem types this estimator supports.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)[source]#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series][source]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

property model_family(cls)#: Returns ModelFamily of this component.

property name(cls)#: Returns string name of this component.

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series[source]#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series[source]#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

property supported_problem_types(cls)#: Problem types this estimator supports.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.ExponentialSmoothingRegressor(trend: Optional[str] = None, damped_trend: bool = False, seasonal: Optional[str] = None, sp: int = 2, n_jobs: int = -1, random_seed: Union[int, float] = 0, **kwargs)[source]#

Holt-Winters Exponential Smoothing Forecaster.

Currently ExponentialSmoothingRegressor isn’t supported via conda install. It’s recommended that it be installed via PyPI.

Parameters

trend (str) – Type of trend component. Defaults to None.
damped_trend (bool) – If the trend component should be damped. Defaults to False.
seasonal (str) – Type of seasonal component. Takes one of {“additive”, None}. Can also be multiplicative if
0 (none of the target data is) –
None. (but AutoMLSearch wiill not tune for this. Defaults to) –
sp (int) – The number of seasonal periods to consider. Defaults to 2.
n_jobs (int or None) – Non-negative integer describing level of parallelism used for pipelines. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “trend”: [None, “additive”], “damped_trend”: [True, False], “seasonal”: [None, “additive”], “sp”: Integer(2, 8),}
model_family	ModelFamily.EXPONENTIAL_SMOOTHING
modifies_features	True
modifies_target	False
name	Exponential Smoothing Regressor
supported_problem_types	[ProblemTypes.TIME_SERIES_REGRESSION]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns array of 0's with a length of 1 as feature_importance is not defined for Exponential Smoothing regressor.
`fit`	Fits Exponential Smoothing Regressor to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted ExponentialSmoothingRegressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using fitted Exponential Smoothing regressor.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#: Returns array of 0’s with a length of 1 as feature_importance is not defined for Exponential Smoothing regressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)[source]#

Fits Exponential Smoothing Regressor to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features]. Ignored.
y (pd.Series) – The target training data of length [n_samples].

Returns

self

Raises

ValueError – If y was not passed in.

Find the prediction intervals using the fitted ExponentialSmoothingRegressor.

Calculates the prediction intervals by using a simulation of the time series following a specified state space model.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Optional.
coverage (List[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Not used for Exponential Smoothing regressor.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None) → pandas.Series[source]#

Make predictions using fitted Exponential Smoothing regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features]. Ignored except to set forecast horizon.
y (pd.Series) – Target data.

Returns

Predicted values.

Return type

pd.Series

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.ExtraTreesClassifier(n_estimators=100, max_features='sqrt', max_depth=6, min_samples_split=2, min_weight_fraction_leaf=0.0, n_jobs=-1, random_seed=0, **kwargs)[source]#

Extra Trees Classifier.

Parameters

n_estimators (float) – The number of trees in the forest. Defaults to 100.
max_features (int, float or {"sqrt", "log2"}) –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features = n_features.
The search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
max_depth (int) – The maximum depth of the tree. Defaults to 6.
min_samples_split (int or float) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
2. (Defaults to) –
min_weight_fraction_leaf (float) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Defaults to 0.0.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all processes. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “n_estimators”: Integer(10, 1000), “max_features”: [“sqrt”, “log2”], “max_depth”: Integer(4, 10),}
model_family	ModelFamily.EXTRA_TREES
modifies_features	True
modifies_target	False
name	Extra Trees Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.ExtraTreesRegressor(n_estimators: int = 100, max_features: str = 'sqrt', max_depth: int = 6, min_samples_split: int = 2, min_weight_fraction_leaf: float = 0.0, n_jobs: int = -1, random_seed: Union[int, float] = 0, **kwargs)[source]#

Extra Trees Regressor.

Parameters

n_estimators (float) – The number of trees in the forest. Defaults to 100.
max_features (int, float or {"sqrt", "log2"}) –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features = n_features.
The search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
max_depth (int) – The maximum depth of the tree. Defaults to 6.
min_samples_split (int or float) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
2. (Defaults to) –
min_weight_fraction_leaf (float) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Defaults to 0.0.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all processes. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “n_estimators”: Integer(10, 1000), “max_features”: [“sqrt”, “log2”], “max_depth”: Integer(4, 10),}
model_family	ModelFamily.EXTRA_TREES
modifies_features	True
modifies_target	False
name	Extra Trees Regressor
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION, ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted ExtraTreesRegressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

Find the prediction intervals using the fitted ExtraTreesRegressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Optional.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.FeatureSelector(parameters=None, component_obj=None, random_seed=0, **kwargs)[source]#

Selects top features based on importance weights.

Parameters

parameters (dict) – Dictionary of parameters for the component. Defaults to None.
component_obj (obj) – Third-party objects useful in component implementation. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

modifies_features	True
modifies_target	False
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits component to data.
`fit_transform`	Fit and transform data using the feature selector.
`get_names`	Get names of selected features.
`load`	Loads component at file path.
`name`	Returns string name of this component.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms input data by selecting features. If the component_obj does not have a transform method, will raise an MethodPropertyNotFoundError exception.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)#

Fits component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features]
y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

MethodPropertyNotFoundError – If component does not have a fit method or a component_obj that implements fit.

fit_transform(self, X, y=None)[source]#

Fit and transform data using the feature selector.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Transformed data.

Return type

pd.DataFrame

get_names(self)[source]#

Get names of selected features.

Returns: List of the names of features selected.
Return type: list[str]

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

property name(cls)#: Returns string name of this component.

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms input data by selecting features. If the component_obj does not have a transform method, will raise an MethodPropertyNotFoundError exception.

Parameters

X (pd.DataFrame) – Data to transform.
y (pd.Series, optional) – Target data. Ignored.

Returns

Transformed X

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If feature selector does not have a transform method or a component_obj that implements transform

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.Imputer(categorical_impute_strategy='most_frequent', categorical_fill_value=None, numeric_impute_strategy='mean', numeric_fill_value=None, boolean_impute_strategy='most_frequent', boolean_fill_value=None, random_seed=0, **kwargs)[source]#

Imputes missing data according to a specified imputation strategy.

Parameters

categorical_impute_strategy (string) – Impute strategy to use for string, object, boolean, categorical dtypes. Valid values include “most_frequent” and “constant”.
numeric_impute_strategy (string) – Impute strategy to use for numeric columns. Valid values include “mean”, “median”, “most_frequent”, and “constant”.
boolean_impute_strategy (string) – Impute strategy to use for boolean columns. Valid values include “most_frequent” and “constant”.
categorical_fill_value (string) – When categorical_impute_strategy == “constant”, fill_value is used to replace missing data. The default value of None will fill with the string “missing_value”.
numeric_fill_value (int, float) – When numeric_impute_strategy == “constant”, fill_value is used to replace missing data. The default value of None will fill with 0.
boolean_fill_value (bool) – When boolean_impute_strategy == “constant”, fill_value is used to replace missing data. The default value of None will fill with True.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “categorical_impute_strategy”: [“most_frequent”], “numeric_impute_strategy”: [“mean”, “median”, “most_frequent”, “knn”], “boolean_impute_strategy”: [“most_frequent”]}
modifies_features	True
modifies_target	False
name	Imputer
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits imputer to data. 'None' values are converted to np.nan before imputation and are treated as the same.
`fit_transform`	Fits on X and transforms X.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms data X by imputing missing values.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits imputer to data. ‘None’ values are converted to np.nan before imputation and are treated as the same.

Parameters

X (pd.DataFrame, np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data X by imputing missing values.

Parameters

X (pd.DataFrame) – Data to transform
y (pd.Series, optional) – Ignored.

Returns

Transformed X

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, random_seed=0, **kwargs)[source]#

K-Nearest Neighbors Classifier.

Parameters

n_neighbors (int) – Number of neighbors to use by default. Defaults to 5.
weights ({‘uniform’, ‘distance’} or callable) –
Weight function used in prediction. Can be:
- ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
- ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
- [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
Defaults to “uniform”.
algorithm ({‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}) –
Algorithm used to compute the nearest neighbors:
- ‘ball_tree’ will use BallTree
- ‘kd_tree’ will use KDTree
- ‘brute’ will use a brute-force search.
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method. Defaults to “auto”. Note: fitting on sparse input will override the setting of this parameter, using brute force.
leaf_size (int) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. Defaults to 30.
p (int) – Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. Defaults to 2.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “n_neighbors”: Integer(2, 12), “weights”: [“uniform”, “distance”], “algorithm”: [“auto”, “ball_tree”, “kd_tree”, “brute”], “leaf_size”: Integer(10, 30), “p”: Integer(1, 5),}
model_family	ModelFamily.K_NEIGHBORS
modifies_features	True
modifies_target	False
name	KNN Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns array of 0's matching the input number of features as feature_importance is not defined for KNN classifiers.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Returns array of 0’s matching the input number of features as feature_importance is not defined for KNN classifiers.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series[source]#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series

predict_proba(self, X: pandas.DataFrame) → pandas.Series[source]#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.LightGBMClassifier(boosting_type='gbdt', learning_rate=0.1, n_estimators=100, max_depth=0, num_leaves=31, min_child_samples=20, bagging_fraction=0.9, bagging_freq=0, n_jobs=-1, random_seed=0, **kwargs)[source]#

LightGBM Classifier.

Parameters

boosting_type (string) – Type of boosting to use. Defaults to “gbdt”. - ‘gbdt’ uses traditional Gradient Boosting Decision Tree - “dart”, uses Dropouts meet Multiple Additive Regression Trees - “goss”, uses Gradient-based One-Side Sampling - “rf”, uses Random Forest
learning_rate (float) – Boosting learning rate. Defaults to 0.1.
n_estimators (int) – Number of boosted trees to fit. Defaults to 100.
max_depth (int) – Maximum tree depth for base learners, <=0 means no limit. Defaults to 0.
num_leaves (int) – Maximum tree leaves for base learners. Defaults to 31.
min_child_samples (int) – Minimum number of data needed in a child (leaf). Defaults to 20.
bagging_fraction (float) – LightGBM will randomly select a subset of features on each iteration (tree) without resampling if this is smaller than 1.0. For example, if set to 0.8, LightGBM will select 80% of features before training each tree. This can be used to speed up training and deal with overfitting. Defaults to 0.9.
bagging_freq (int) – Frequency for bagging. 0 means bagging is disabled. k means perform bagging at every k iteration. Every k-th iteration, LightGBM will randomly select bagging_fraction * 100 % of the data to use for the next k iterations. Defaults to 0.
n_jobs (int or None) – Number of threads to run in parallel. -1 uses all threads. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “learning_rate”: Real(0.000001, 1), “boosting_type”: [“gbdt”, “dart”, “goss”, “rf”], “n_estimators”: Integer(10, 100), “max_depth”: Integer(0, 10), “num_leaves”: Integer(2, 100), “min_child_samples”: Integer(1, 100), “bagging_fraction”: Real(0.000001, 1), “bagging_freq”: Integer(0, 1),}
model_family	ModelFamily.LIGHTGBM
modifies_features	True
modifies_target	False
name	LightGBM Classifier
SEED_MAX	SEED_BOUNDS.max_bound
SEED_MIN	0
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits LightGBM classifier component to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using the fitted LightGBM classifier.
`predict_proba`	Make prediction probabilities using the fitted LightGBM classifier.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X, y=None)[source]#

Fits LightGBM classifier component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X)[source]#

Make predictions using the fitted LightGBM classifier.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.DataFrame

predict_proba(self, X)[source]#

Make prediction probabilities using the fitted LightGBM classifier.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted probability values.
Return type: pd.DataFrame

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.LightGBMRegressor(boosting_type='gbdt', learning_rate=0.1, n_estimators=20, max_depth=0, num_leaves=31, min_child_samples=20, bagging_fraction=0.9, bagging_freq=0, n_jobs=-1, random_seed=0, **kwargs)[source]#

LightGBM Regressor.

Parameters

boosting_type (string) – Type of boosting to use. Defaults to “gbdt”. - ‘gbdt’ uses traditional Gradient Boosting Decision Tree - “dart”, uses Dropouts meet Multiple Additive Regression Trees - “goss”, uses Gradient-based One-Side Sampling - “rf”, uses Random Forest
learning_rate (float) – Boosting learning rate. Defaults to 0.1.
n_estimators (int) – Number of boosted trees to fit. Defaults to 100.
max_depth (int) – Maximum tree depth for base learners, <=0 means no limit. Defaults to 0.
num_leaves (int) – Maximum tree leaves for base learners. Defaults to 31.
min_child_samples (int) – Minimum number of data needed in a child (leaf). Defaults to 20.
bagging_fraction (float) – LightGBM will randomly select a subset of features on each iteration (tree) without resampling if this is smaller than 1.0. For example, if set to 0.8, LightGBM will select 80% of features before training each tree. This can be used to speed up training and deal with overfitting. Defaults to 0.9.
bagging_freq (int) – Frequency for bagging. 0 means bagging is disabled. k means perform bagging at every k iteration. Every k-th iteration, LightGBM will randomly select bagging_fraction * 100 % of the data to use for the next k iterations. Defaults to 0.
n_jobs (int or None) – Number of threads to run in parallel. -1 uses all threads. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “learning_rate”: Real(0.000001, 1), “boosting_type”: [“gbdt”, “dart”, “goss”, “rf”], “n_estimators”: Integer(10, 100), “max_depth”: Integer(0, 10), “num_leaves”: Integer(2, 100), “min_child_samples”: Integer(1, 100), “bagging_fraction”: Real(0.000001, 1), “bagging_freq”: Integer(0, 1),}
model_family	ModelFamily.LIGHTGBM
modifies_features	True
modifies_target	False
name	LightGBM Regressor
SEED_MAX	SEED_BOUNDS.max_bound
SEED_MIN	0
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits LightGBM regressor to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using fitted LightGBM regressor.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X, y=None)[source]#

Fits LightGBM regressor to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X)[source]#

Make predictions using fitted LightGBM regressor.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.LinearRegressor(fit_intercept=True, n_jobs=-1, random_seed=0, **kwargs)[source]#

Linear Regressor.

Parameters

fit_intercept (boolean) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered). Defaults to True.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all threads. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “fit_intercept”: [True, False],}
model_family	ModelFamily.LINEAR_MODEL
modifies_features	True
modifies_target	False
name	Linear Regressor
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION, ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance for fitted linear regressor.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Feature importance for fitted linear regressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.LogisticRegressionClassifier(penalty='l2', C=1.0, multi_class='auto', solver='lbfgs', n_jobs=-1, random_seed=0, **kwargs)[source]#

Logistic Regression Classifier.

Parameters

penalty ({"l1", "l2", "elasticnet", "none"}) – The norm used in penalization. Defaults to “l2”.
C (float) – Inverse of regularization strength. Must be a positive float. Defaults to 1.0.
multi_class ({"auto", "ovr", "multinomial"}) – If the option chosen is “ovr”, then a binary problem is fit for each label. For “multinomial” the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. “multinomial” is unavailable when solver=”liblinear”. “auto” selects “ovr” if the data is binary, or if solver=”liblinear”, and otherwise selects “multinomial”. Defaults to “auto”.
solver ({"newton-cg", "lbfgs", "liblinear", "sag", "saga"}) –
Algorithm to use in the optimization problem. For small datasets, “liblinear” is a good choice, whereas “sag” and “saga” are faster for large ones. For multiclass problems, only “newton-cg”, “sag”, “saga” and “lbfgs” handle multinomial loss; “liblinear” is limited to one-versus-rest schemes.
- ”newton-cg”, “lbfgs”, “sag” and “saga” handle L2 or no penalty
- ”liblinear” and “saga” also handle L1 penalty
- ”saga” also supports “elasticnet” penalty
- ”liblinear” does not support setting penalty=’none’
Defaults to “lbfgs”.
n_jobs (int) – Number of parallel threads used to run xgboost. Note that creating thread contention will significantly slow down the algorithm. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “penalty”: [“l2”], “C”: Real(0.01, 10),}
model_family	ModelFamily.LINEAR_MODEL
modifies_features	True
modifies_target	False
name	Logistic Regression Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance for fitted logistic regression classifier.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Feature importance for fitted logistic regression classifier.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.MulticlassClassificationPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline subclass for all multiclass classification pipelines.

Parameters

component_graph (ComponentGraph, list, dict) – ComponentGraph instance, list of components in order, or dictionary of components. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
custom_name (str) – Custom name for the pipeline. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> pipeline = MulticlassClassificationPipeline(component_graph=["Simple Imputer", "Logistic Regression Classifier"],
...                                             parameters={"Logistic Regression Classifier": {"penalty": "elasticnet",
...                                                                                            "solver": "liblinear"}},
...                                             custom_name="My Multiclass Pipeline")
...
>>> assert pipeline.custom_name == "My Multiclass Pipeline"
>>> assert pipeline.component_graph.component_dict.keys() == {'Simple Imputer', 'Logistic Regression Classifier'}

The pipeline parameters will be chosen from the default parameters for every component, unless specific parameters were passed in as they were above.

>>> assert pipeline.parameters == {
...     'Simple Imputer': {'impute_strategy': 'most_frequent', 'fill_value': None},
...     'Logistic Regression Classifier': {'penalty': 'elasticnet',
...                                        'C': 1.0,
...                                        'n_jobs': -1,
...                                        'multi_class': 'auto',
...                                        'solver': 'liblinear'}}

Attributes

problem_type

ProblemTypes.MULTICLASS

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`classes_`	Gets the class names for the pipeline. Will return None before pipeline is fit.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Build a classification model. For string and categorical targets, classes are sorted by sorted(set(y)) and then are mapped to values between 0 and n_classes-1.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

property classes_(self)#: Gets the class names for the pipeline. Will return None before pipeline is fit.

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)#

Build a classification model. For string and categorical targets, classes are sorted by sorted(set(y)) and then are mapped to values between 0 and n_classes-1.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target training labels of length [n_samples]

Returns

self

Raises

ValueError – If the number of unique classes in y are not appropriate for the type of pipeline.
TypeError – If the dtype is boolean but pd.NA exists in the series.
Exception – For all other exceptions.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)#

Make predictions using selected features.

Note: we cast y as ints first to address boolean values that may be returned from calculating predictions which we would not be able to otherwise transform if we originally had integer targets.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame) – Training data. Ignored. Only used for time series.
y_train (pd.Series) – Training labels. Ignored. Only used for time series.

Returns

Estimated labels.

Return type

pd.Series

predict_proba(self, X, X_train=None, y_train=None)#

Make probability estimates for labels.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features]
X_train (pd.DataFrame or np.ndarray or None) – Training data. Ignored. Only used for time series.
y_train (pd.Series or None) – Training labels. Ignored. Only used for time series.

Returns

Probability estimates

Return type

pd.DataFrame

Raises

ValueError – If final component is not an estimator.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)#

Evaluate model performance on objectives.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features]
y (pd.Series) – True labels of length [n_samples]
objectives (list) – List of objectives to score
X_train (pd.DataFrame) – Training data. Ignored. Only used for time series.
y_train (pd.Series) – Training labels. Ignored. Only used for time series.

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series or None) – Targets corresponding to X. Optional.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Only used for time series.
y_train (pd.Series or None) – Training labels. Only used for time series.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.MultiseriesRegressionPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline base class for multiseries time series regression problems.

Parameters

component_graph (ComponentGraph, list, dict) – ComponentGraph instance, list of components in order, or dictionary of components.
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary {} implies using all default values for component parameters. Pipeline-level parameters such as time_index, gap, and max_delay must be specified with the “pipeline” key. For example: Pipeline(parameters={“pipeline”: {“time_index”: “Date”, “max_delay”: 4, “gap”: 2}}).
custom_name (str) – Custom name for the pipeline. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

NO_PREDS_PI_ESTIMATORS	ProblemTypes.TIME_SERIES_REGRESSION
problem_type	ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`dates_needed_for_prediction`	Return dates needed to forecast the given date in the future.
`dates_needed_for_prediction_range`	Return dates needed to forecast the given date in the future.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Fit a multiseries time series pipeline.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_forecast_period`	Generates all possible forecasting time points based on latest data point in X.
`get_forecast_predictions`	Generates all possible forecasting predictions based on last period of X.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Predict on future data where target is not known.
`predict_in_sample`	Predict on future data where the target is known, e.g. cross validation.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on current and additional objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

dates_needed_for_prediction(self, date)#

Return dates needed to forecast the given date in the future.

Parameters: date (pd.Timestamp) – Date to forecast in the future.
Returns: Range of dates needed to forecast the given date.
Return type: dates_needed (tuple(pd.Timestamp))

dates_needed_for_prediction_range(self, start_date, end_date)#

Return dates needed to forecast the given date in the future.

Parameters

start_date (pd.Timestamp) – Start date of range to forecast in the future.
end_date (pd.Timestamp) – End date of range to forecast in the future.

Returns

Range of dates needed to forecast the given date.

Return type

dates_needed (tuple(pd.Timestamp))

Raises

ValueError – If start_date doesn’t come before end_date

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)[source]#

Fit a multiseries time series pipeline.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training targets of length [n_samples*n_series].

Returns

self

Raises

ValueError – If the target is not numeric.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_forecast_period(self, X)[source]#

Generates all possible forecasting time points based on latest data point in X.

For the multiseries case, each time stamp is duplicated for each unique value in X’s series_id column. Input data must be stacked in order to properly generate unique periods.

Parameters: X (pd.DataFrame, np.ndarray) – Stacked data the pipeline was trained on of shape [n_samples_train * n_series_ids, n_features].
Raises: ValueError – If pipeline is not trained.
Returns: Dataframe containing a column with datetime periods from gap to forecast_horizon + gap per unique series_id value.
Return type: pd.DataFrame

get_forecast_predictions(self, X, y)#

Generates all possible forecasting predictions based on last period of X.

Parameters

X (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_feautures].
y (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Predictions from gap periods out to forecast_horizon + gap periods.

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

get_prediction_intervals(self, X, y=None, X_train=None, y_train=None, coverage=None)#

Find the prediction intervals using the fitted regressor.

Certain estimators (Extra Trees Estimator, XGBoost Estimator, Prophet Estimator, ARIMA, and Exponential Smoothing estimator) utilize a different methodology to calculate prediction intervals. See the docs for these estimators to learn more.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data.
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)#

Predict on future data where target is not known.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame or np.ndarray or None) – Training data.
y_train (pd.Series or None) – Training labels.

Raises

ValueError – If X_train and/or y_train are None or if final component is not an Estimator.

Returns

Predictions.

predict_in_sample(self, X, y, X_train, y_train, objective=None, calculating_residuals=False, include_series_id=False)[source]#

Predict on future data where the target is known, e.g. cross validation.

Parameters

X (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – Future target of shape [n_samples]
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_feautures]
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train]
objective (ObjectiveBase, str, None) – Objective used to threshold predicted probabilities, optional.
calculating_residuals (bool) – Whether we’re calling predict_in_sample to calculate the residuals. This means the X and y arguments are not future data, but actually the train data.
include_series_id (bool) – If true, include the series ID value in the prediction results

Returns

Estimated labels.

Return type

pd.Series

Raises

ValueError – If final component is not an Estimator.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)#

Evaluate model performance on current and additional objectives.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – True labels of length [n_samples].
objectives (list) – Non-empty list of objectives to score on.
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_feautures].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None, calculating_residuals=False)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series) – Targets corresponding to the pipeline targets.
X_train (pd.DataFrame) – Training data used to generate generates from past observations.
y_train (pd.Series) – Training targets used to generate features from past observations.
calculating_residuals (bool) – Whether we’re calling predict_in_sample to calculate the residuals. This means the X and y arguments are not future data, but actually the train data.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.OneHotEncoder(top_n=10, features_to_encode=None, categories=None, drop='if_binary', handle_unknown='ignore', handle_missing='error', random_seed=0, **kwargs)[source]#

A transformer that encodes categorical features in a one-hot numeric array.

Parameters

top_n (int) – Number of categories per column to encode. If None, all categories will be encoded. Otherwise, the n most frequent will be encoded and all others will be dropped. Defaults to 10.
features_to_encode (list[str]) – List of columns to encode. All other columns will remain untouched. If None, all appropriate columns will be encoded. Defaults to None.
categories (list) – A two dimensional list of categories, where categories[i] is a list of the categories for the column at index i. This can also be None, or “auto” if top_n is not None. Defaults to None.
drop (string, list) – Method (“first” or “if_binary”) to use to drop one category per feature. Can also be a list specifying which categories to drop for each feature. Defaults to ‘if_binary’.
handle_unknown (string) – Whether to ignore or error for unknown categories for a feature encountered during fit or transform. If either top_n or categories is used to limit the number of categories per column, this must be “ignore”. Defaults to “ignore”.
handle_missing (string) – Options for how to handle missing (NaN) values encountered during fit or transform. If this is set to “as_category” and NaN values are within the n most frequent, “nan” values will be encoded as their own column. If this is set to “error”, any missing values encountered will raise an error. Defaults to “error”.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{}
modifies_features	True
modifies_target	False
name	One Hot Encoder
training_only	False

Methods

`categories`	Returns a list of the unique categories to be encoded for the particular feature, in order.
`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits the one-hot encoder component.
`fit_transform`	Fits on X and transforms X.
`get_feature_names`	Return feature names for the categorical features after fitting.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	One-hot encode the input data.
`update_parameters`	Updates the parameter dictionary of the component.

categories(self, feature_name)[source]#

Returns a list of the unique categories to be encoded for the particular feature, in order.

Parameters: feature_name (str) – The name of any feature provided to one-hot encoder during fit.
Returns: The unique categories, in the same dtype as they were provided during fit.
Return type: np.ndarray
Raises: ValueError – If feature was not provided to one-hot encoder as a training feature.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the one-hot encoder component.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

Raises

ValueError – If encoding a column failed.

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

get_feature_names(self)[source]#

Return feature names for the categorical features after fitting.

Feature names are formatted as {column name}_{category name}. In the event of a duplicate name, an integer will be added at the end of the feature name to distinguish it.

For example, consider a dataframe with a column called “A” and category “x_y” and another column called “A_x” with “y”. In this example, the feature names would be “A_x_y” and “A_x_y_1”.

Returns: The feature names after encoding, provided in the same order as input_features.
Return type: np.ndarray

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

One-hot encode the input data.

Parameters

X (pd.DataFrame) – Features to one-hot encode.
y (pd.Series) – Ignored.

Returns

Transformed data, where each categorical feature has been encoded into numerical columns using one-hot encoding.

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.OrdinalEncoder(features_to_encode=None, categories=None, handle_unknown='error', unknown_value=None, encoded_missing_value=None, random_seed=0, **kwargs)[source]#

A transformer that encodes ordinal features as an array of ordinal integers representing the relative order of categories.

Parameters

features_to_encode (list[str]) – List of columns to encode. All other columns will remain untouched. If None, all appropriate columns will be encoded. Defaults to None. The order of columns does not matter.
categories (dict[str, list[str]]) – A dictionary mapping column names to their categories in the dataframes passed in at fit and transform. The order of categories specified for a column does not matter. Any category found in the data that is not present in categories will be handled as an unknown value. To not have unknown values raise an error, set handle_unknown to “use_encoded_value”. Defaults to None.
handle_unknown ("error" or "use_encoded_value") – Whether to ignore or error for unknown categories for a feature encountered during fit or transform. When set to “error”, an error will be raised when an unknown category is found. When set to “use_encoded_value”, unknown categories will be encoded as the value given for the parameter unknown_value. Defaults to “error.”
unknown_value (int or np.nan) – The value to use for unknown categories seen during fit or transform. Required when the parameter handle_unknown is set to “use_encoded_value.” The value has to be distinct from the values used to encode any of the categories in fit. Defaults to None.
encoded_missing_value (int or np.nan) – The value to use for missing (null) values seen during fit or transform. Defaults to np.nan.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{}
modifies_features	True
modifies_target	False
name	Ordinal Encoder
training_only	False

Methods

`categories`	Returns a list of the unique categories to be encoded for the particular feature, in order.
`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits the ordinal encoder component.
`fit_transform`	Fits on X and transforms X.
`get_feature_names`	Return feature names for the ordinal features after fitting.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Ordinally encode the input data.
`update_parameters`	Updates the parameter dictionary of the component.

categories(self, feature_name)[source]#

Returns a list of the unique categories to be encoded for the particular feature, in order.

Parameters: feature_name (str) – The name of any feature provided to ordinal encoder during fit.
Returns: The unique categories, in the same dtype as they were provided during fit.
Return type: np.ndarray
Raises: ValueError – If feature was not provided to ordinal encoder as a training feature.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the ordinal encoder component.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

Raises

ValueError – If encoding a column failed.
TypeError – If non-Ordinal columns are specified in features_to_encode.

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

get_feature_names(self)[source]#

Return feature names for the ordinal features after fitting.

Feature names are formatted as {column name}_ordinal_encoding.

Returns: The feature names after encoding, provided in the same order as input_features.
Return type: np.ndarray

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Ordinally encode the input data.

Parameters

X (pd.DataFrame) – Features to encode.
y (pd.Series) – Ignored.

Returns

Transformed data, where each ordinal feature has been encoded into a numerical column where ordinal integers represent the relative order of categories.

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.PerColumnImputer(impute_strategies=None, random_seed=0, **kwargs)[source]#

Imputes missing data according to a specified imputation strategy per column.

Parameters

impute_strategies (dict) – Column and {“impute_strategy”: strategy, “fill_value”:value} pairings. Valid values for impute strategy include “mean”, “median”, “most_frequent”, “constant” for numerical data, and “most_frequent”, “constant” for object data types. Defaults to None, which uses “most_frequent” for all columns. When impute_strategy == “constant”, fill_value is used to replace missing data. When None, uses 0 when imputing numerical data and “missing_value” for strings or object data types.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{}
modifies_features	True
modifies_target	False
name	Per Column Imputer
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits imputers on input data.
`fit_transform`	Fits on X and transforms X.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms input data by imputing missing values.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits imputers on input data.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features] to fit.
y (pd.Series, optional) – The target training data of length [n_samples]. Ignored.

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms input data by imputing missing values.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features] to transform.
y (pd.Series, optional) – The target training data of length [n_samples]. Ignored.

Returns

Transformed X

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.PipelineBase(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Machine learning pipeline.

Parameters

component_graph (ComponentGraph, list, dict) – ComponentGraph instance, list of components in order, or dictionary of components. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”].
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
custom_name (str) – Custom name for the pipeline. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

problem_type

None

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Build a model.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Make predictions using selected features.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on current and additional objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)[source]#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

clone(self)[source]#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)[source]#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

describe(self, return_dict=False)[source]#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

abstract fit(self, X, y)[source]#

Build a model.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features].
y (pd.Series, np.ndarray) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y)[source]#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)[source]#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)[source]#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)[source]#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)[source]#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)[source]#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)[source]#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])[source]#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)[source]#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)[source]#

Make predictions using selected features.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Ignored. Only used for time series.
y_train (pd.Series or None) – Training labels. Ignored. Only used for time series.

Returns

Predicted values.

Return type

pd.Series

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)[source]#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

abstract score(self, X, y, objectives, X_train=None, y_train=None)[source]#

Evaluate model performance on current and additional objectives.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series, np.ndarray) – True labels of length [n_samples].
objectives (list) – Non-empty list of objectives to score on.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Ignored. Only used for time series.
y_train (pd.Series or None) – Training labels. Ignored. Only used for time series.

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

transform(self, X, y=None)[source]#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None)[source]#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series or None) – Targets corresponding to X. Optional.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Only used for time series.
y_train (pd.Series or None) – Training labels. Only used for time series.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.ProphetRegressor(time_index: Optional[Hashable] = None, changepoint_prior_scale: float = 0.05, seasonality_prior_scale: int = 10, holidays_prior_scale: int = 10, seasonality_mode: str = 'additive', stan_backend: str = 'CMDSTANPY', interval_width: float = 0.95, random_seed: Union[int, float] = 0, **kwargs)[source]#

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

More information here: https://facebook.github.io/prophet/

Parameters

time_index (str) – Specifies the name of the column in X that provides the datetime objects. Defaults to None.
changepoint_prior_scale (float) – Determines the strength of the sparse prior for fitting on rate changes. Increasing this value increases the flexibility of the trend. Defaults to 0.05.
seasonality_prior_scale (int) – Similar to changepoint_prior_scale. Adjusts the extent to which the seasonality model will fit the data. Defaults to 10.
holidays_prior_scale (int) – Similar to changepoint_prior_scale. Adjusts the extent to which holidays will fit the data. Defaults to 10.
seasonality_mode (str) – Determines how this component fits the seasonality. Options are “additive” and “multiplicative”. Defaults to “additive”.
stan_backend (str) – Determines the backend that should be used to run Prophet. Options are “CMDSTANPY” and “PYSTAN”. Defaults to “CMDSTANPY”.
interval_width (float) – Determines the confidence of the prediction interval range when calling get_prediction_intervals. Accepts values in the range (0,1). Defaults to 0.95.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “changepoint_prior_scale”: Real(0.001, 0.5), “seasonality_prior_scale”: Real(0.01, 10), “holidays_prior_scale”: Real(0.01, 10), “seasonality_mode”: [“additive”, “multiplicative”],}
model_family	ModelFamily.PROPHET
modifies_features	True
modifies_target	False
name	Prophet Regressor
supported_problem_types	[ProblemTypes.TIME_SERIES_REGRESSION]
training_only	False

Methods

`build_prophet_df`	Build the Prophet data to pass fit and predict on.
`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns array of 0's with len(1) as feature_importance is not defined for Prophet regressor.
`fit`	Fits Prophet regressor component to data.
`get_params`	Get parameters for the Prophet regressor.
`get_prediction_intervals`	Find the prediction intervals using the fitted ProphetRegressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using fitted Prophet regressor.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

static build_prophet_df(X: pandas.DataFrame, y: Optional[pandas.Series] = None, time_index: str = 'ds') → pandas.DataFrame[source]#: Build the Prophet data to pass fit and predict on.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls) → dict#

Returns the default parameters for this component.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → numpy.ndarray#: Returns array of 0’s with len(1) as feature_importance is not defined for Prophet regressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)[source]#

Fits Prophet regressor component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

get_params(self) → dict[source]#: Get parameters for the Prophet regressor.

Find the prediction intervals using the fitted ProphetRegressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (List[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Not used for Prophet estimator.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None) → pandas.Series[source]#

Make predictions using fitted Prophet regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.

Returns

Predicted values.

Return type

pd.Series

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.RandomForestClassifier(n_estimators=100, max_depth=6, n_jobs=-1, random_seed=0, **kwargs)[source]#

Random Forest Classifier.

Parameters

n_estimators (float) – The number of trees in the forest. Defaults to 100.
max_depth (int) – Maximum tree depth for base learners. Defaults to 6.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all processes. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “n_estimators”: Integer(10, 1000), “max_depth”: Integer(1, 10),}
model_family	ModelFamily.RANDOM_FOREST
modifies_features	True
modifies_target	False
name	Random Forest Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.RandomForestRegressor(n_estimators: int = 100, max_depth: int = 6, n_jobs: int = -1, random_seed: Union[int, float] = 0, **kwargs)[source]#

Random Forest Regressor.

Parameters

n_estimators (float) – The number of trees in the forest. Defaults to 100.
max_depth (int) – Maximum tree depth for base learners. Defaults to 6.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all processes. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “n_estimators”: Integer(10, 1000), “max_depth”: Integer(1, 32),}
model_family	ModelFamily.RANDOM_FOREST
modifies_features	True
modifies_target	False
name	Random Forest Regressor
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION, ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns importance associated with each feature.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted RandomForestRegressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#

Returns importance associated with each feature.

Returns: Importance associated with each feature.
Return type: np.ndarray
Raises: MethodPropertyNotFoundError – If estimator does not have a feature_importance method or a component_obj that implements feature_importance.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

Find the prediction intervals using the fitted RandomForestRegressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Optional.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.RegressionPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline subclass for all regression pipelines.

Parameters

component_graph (ComponentGraph, list, dict) – ComponentGraph instance, list of components in order, or dictionary of components. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
custom_name (str) – Custom name for the pipeline. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> pipeline = RegressionPipeline(component_graph=["Simple Imputer", "Linear Regressor"],
...                               parameters={"Simple Imputer": {"impute_strategy": "mean"}},
...                               custom_name="My Regression Pipeline")
...
>>> assert pipeline.custom_name == "My Regression Pipeline"
>>> assert pipeline.component_graph.component_dict.keys() == {'Simple Imputer', 'Linear Regressor'}

The pipeline parameters will be chosen from the default parameters for every component, unless specific parameters were passed in as they were above.

>>> assert pipeline.parameters == {
...     'Simple Imputer': {'impute_strategy': 'mean', 'fill_value': None},
...     'Linear Regressor': {'fit_intercept': True, 'n_jobs': -1}}

Attributes

problem_type

ProblemTypes.REGRESSION

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Build a regression model.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Make predictions using selected features.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on current and additional objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)[source]#

Build a regression model.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target training data of length [n_samples]

Returns

self

Raises

ValueError – If the target is not numeric.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)[source]#

Make predictions using selected features.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Ignored. Only used for time series.
y_train (pd.Series or None) – Training labels. Ignored. Only used for time series.

Returns

Predicted values.

Return type

pd.Series

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)[source]#

Evaluate model performance on current and additional objectives.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features]
y (pd.Series, or np.ndarray) – True values of length [n_samples]
objectives (list) – Non-empty list of objectives to score on
X_train (pd.DataFrame or np.ndarray or None) – Training data. Ignored. Only used for time series.
y_train (pd.Series or None) – Training labels. Ignored. Only used for time series.

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series or None) – Targets corresponding to X. Optional.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Only used for time series.
y_train (pd.Series or None) – Training labels. Only used for time series.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.RFClassifierSelectFromModel(number_features=None, n_estimators=10, max_depth=None, percent_features=0.5, threshold='median', n_jobs=-1, random_seed=0, **kwargs)[source]#

Selects top features based on importance weights using a Random Forest classifier.

Parameters

number_features (int) – The maximum number of features to select. If both percent_features and number_features are specified, take the greater number of features. Defaults to None.
n_estimators (int) – The number of trees in the forest. Defaults to 10.
max_depth (int) – Maximum tree depth for base learners. Defaults to None.
percent_features (float) – Percentage of features to use. If both percent_features and number_features are specified, take the greater number of features. Defaults to 0.5.
threshold (string or float) – The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median”, then the threshold value is the median of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. Defaults to median.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all processes. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “percent_features”: Real(0.01, 1), “threshold”: [“mean”, “median”],}
modifies_features	True
modifies_target	False
name	RF Classifier Select From Model
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits component to data.
`fit_transform`	Fit and transform data using the feature selector.
`get_names`	Get names of selected features.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms input data by selecting features. If the component_obj does not have a transform method, will raise an MethodPropertyNotFoundError exception.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)#

Fits component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features]
y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

MethodPropertyNotFoundError – If component does not have a fit method or a component_obj that implements fit.

fit_transform(self, X, y=None)#

Fit and transform data using the feature selector.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Transformed data.

Return type

pd.DataFrame

get_names(self)#

Get names of selected features.

Returns: List of the names of features selected.
Return type: list[str]

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)#

Transforms input data by selecting features. If the component_obj does not have a transform method, will raise an MethodPropertyNotFoundError exception.

Parameters

X (pd.DataFrame) – Data to transform.
y (pd.Series, optional) – Target data. Ignored.

Returns

Transformed X

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If feature selector does not have a transform method or a component_obj that implements transform

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.RFRegressorSelectFromModel(number_features=None, n_estimators=10, max_depth=None, percent_features=0.5, threshold='median', n_jobs=-1, random_seed=0, **kwargs)[source]#

Selects top features based on importance weights using a Random Forest regressor.

Parameters

number_features (int) – The maximum number of features to select. If both percent_features and number_features are specified, take the greater number of features. Defaults to 0.5.
n_estimators (int) – The number of trees in the forest. Defaults to 10.
max_depth (int) – Maximum tree depth for base learners. Defaults to None.
percent_features (float) – Percentage of features to use. If both percent_features and number_features are specified, take the greater number of features. Defaults to 0.5.
threshold (string or float) – The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median”, then the threshold value is the median of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. Defaults to median.
n_jobs (int or None) – Number of jobs to run in parallel. -1 uses all processes. Defaults to -1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “percent_features”: Real(0.01, 1), “threshold”: [“mean”, “median”],}
modifies_features	True
modifies_target	False
name	RF Regressor Select From Model
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits component to data.
`fit_transform`	Fit and transform data using the feature selector.
`get_names`	Get names of selected features.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms input data by selecting features. If the component_obj does not have a transform method, will raise an MethodPropertyNotFoundError exception.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)#

Fits component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features]
y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

MethodPropertyNotFoundError – If component does not have a fit method or a component_obj that implements fit.

fit_transform(self, X, y=None)#

Fit and transform data using the feature selector.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Transformed data.

Return type

pd.DataFrame

get_names(self)#

Get names of selected features.

Returns: List of the names of features selected.
Return type: list[str]

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)#

Transforms input data by selecting features. If the component_obj does not have a transform method, will raise an MethodPropertyNotFoundError exception.

Parameters

X (pd.DataFrame) – Data to transform.
y (pd.Series, optional) – Target data. Ignored.

Returns

Transformed X

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If feature selector does not have a transform method or a component_obj that implements transform

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.SimpleImputer(impute_strategy='most_frequent', fill_value=None, random_seed=0, **kwargs)[source]#

Imputes missing data according to a specified imputation strategy. Natural language columns are ignored.

Parameters

impute_strategy (string) – Impute strategy to use. Valid values include “mean”, “median”, “most_frequent”, “constant” for numerical data, and “most_frequent”, “constant” for object data types.
fill_value (string) – When impute_strategy == “constant”, fill_value is used to replace missing data. Defaults to 0 when imputing numerical data and “missing_value” for strings or object data types.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “impute_strategy”: [“mean”, “median”, “most_frequent”]}
modifies_features	True
modifies_target	False
name	Simple Imputer
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits imputer to data. 'None' values are converted to np.nan before imputation and are treated as the same.
`fit_transform`	Fits on X and transforms X.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms input by imputing missing values. 'None' and np.nan values are treated as the same.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits imputer to data. ‘None’ values are converted to np.nan before imputation and are treated as the same.

Parameters

X (pd.DataFrame or np.ndarray) – the input training data of shape [n_samples, n_features]
y (pd.Series, optional) – the target training data of length [n_samples]

Returns

self

Raises

ValueError – if the SimpleImputer receives a dataframe with both Boolean and Categorical data.

fit_transform(self, X, y=None)[source]#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform
y (pd.Series, optional) – Target data.

Returns

Transformed X

Return type

pd.DataFrame

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms input by imputing missing values. ‘None’ and np.nan values are treated as the same.

Parameters

X (pd.DataFrame) – Data to transform.
y (pd.Series, optional) – Ignored.

Returns

Transformed X

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.StackedEnsembleBase(final_estimator=None, n_jobs=-1, random_seed=0, **kwargs)[source]#

Stacked Ensemble Base Class.

Parameters

final_estimator (Estimator or subclass) – The estimator used to combine the base estimators.
n_jobs (int or None) – Integer describing level of parallelism used for pipelines. None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs greater than -1, (n_cpus + 1 + n_jobs) are used. Defaults to -1. - Note: there could be some multi-process errors thrown for values of n_jobs != 1. If this is the case, please use n_jobs = 1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

model_family	ModelFamily.ENSEMBLE
modifies_features	True
modifies_target	False
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for stacked ensemble classes.
`describe`	Describe a component and its parameters.
`feature_importance`	Not implemented for StackedEnsembleClassifier and StackedEnsembleRegressor.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`name`	Returns string name of this component.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`supported_problem_types`	Problem types this estimator supports.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for stacked ensemble classes.

Returns: default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Not implemented for StackedEnsembleClassifier and StackedEnsembleRegressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

property name(cls)#: Returns string name of this component.

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

property supported_problem_types(cls)#: Problem types this estimator supports.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.StackedEnsembleClassifier(final_estimator=None, n_jobs=-1, random_seed=0, **kwargs)[source]#

Stacked Ensemble Classifier.

Parameters

final_estimator (Estimator or subclass) – The classifier used to combine the base estimators. If None, uses ElasticNetClassifier.
n_jobs (int or None) – Integer describing level of parallelism used for pipelines. None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Defaults to -1. - Note: there could be some multi-process errors thrown for values of n_jobs != 1. If this is the case, please use n_jobs = 1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> from evalml.pipelines.component_graph import ComponentGraph
>>> from evalml.pipelines.components.estimators.classifiers.decision_tree_classifier import DecisionTreeClassifier
>>> from evalml.pipelines.components.estimators.classifiers.elasticnet_classifier import ElasticNetClassifier
...
>>> component_graph = {
...     "Decision Tree": [DecisionTreeClassifier(random_seed=3), "X", "y"],
...     "Decision Tree B": [DecisionTreeClassifier(random_seed=4), "X", "y"],
...     "Stacked Ensemble": [
...         StackedEnsembleClassifier(n_jobs=1, final_estimator=DecisionTreeClassifier()),
...         "Decision Tree.x",
...         "Decision Tree B.x",
...         "y",
...     ],
... }
...
>>> cg = ComponentGraph(component_graph)
>>> assert cg.default_parameters == {
...     'Decision Tree Classifier': {'criterion': 'gini',
...                                  'max_features': 'sqrt',
...                                  'max_depth': 6,
...                                  'min_samples_split': 2,
...                                  'min_weight_fraction_leaf': 0.0},
...     'Stacked Ensemble Classifier': {'final_estimator': ElasticNetClassifier,
...                                     'n_jobs': -1}}

Attributes

hyperparameter_ranges	{}
model_family	ModelFamily.ENSEMBLE
modifies_features	True
modifies_target	False
name	Stacked Ensemble Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for stacked ensemble classes.
`describe`	Describe a component and its parameters.
`feature_importance`	Not implemented for StackedEnsembleClassifier and StackedEnsembleRegressor.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for stacked ensemble classes.

Returns: default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Not implemented for StackedEnsembleClassifier and StackedEnsembleRegressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.StackedEnsembleRegressor(final_estimator=None, n_jobs=-1, random_seed=0, **kwargs)[source]#

Stacked Ensemble Regressor.

Parameters

final_estimator (Estimator or subclass) – The regressor used to combine the base estimators. If None, uses ElasticNetRegressor.
n_jobs (int or None) – Integer describing level of parallelism used for pipelines. None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs greater than -1, (n_cpus + 1 + n_jobs) are used. Defaults to -1. - Note: there could be some multi-process errors thrown for values of n_jobs != 1. If this is the case, please use n_jobs = 1.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> from evalml.pipelines.component_graph import ComponentGraph
>>> from evalml.pipelines.components.estimators.regressors.rf_regressor import RandomForestRegressor
>>> from evalml.pipelines.components.estimators.regressors.elasticnet_regressor import ElasticNetRegressor
...
>>> component_graph = {
...     "Random Forest": [RandomForestRegressor(random_seed=3), "X", "y"],
...     "Random Forest B": [RandomForestRegressor(random_seed=4), "X", "y"],
...     "Stacked Ensemble": [
...         StackedEnsembleRegressor(n_jobs=1, final_estimator=RandomForestRegressor()),
...         "Random Forest.x",
...         "Random Forest B.x",
...         "y",
...     ],
... }
...
>>> cg = ComponentGraph(component_graph)
>>> assert cg.default_parameters == {
...     'Random Forest Regressor': {'n_estimators': 100,
...                                 'max_depth': 6,
...                                 'n_jobs': -1},
...     'Stacked Ensemble Regressor': {'final_estimator': ElasticNetRegressor,
...                                    'n_jobs': -1}}

Attributes

hyperparameter_ranges	{}
model_family	ModelFamily.ENSEMBLE
modifies_features	True
modifies_target	False
name	Stacked Ensemble Regressor
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for stacked ensemble classes.
`describe`	Describe a component and its parameters.
`feature_importance`	Not implemented for StackedEnsembleClassifier and StackedEnsembleRegressor.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for stacked ensemble classes.

Returns: default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Not implemented for StackedEnsembleClassifier and StackedEnsembleRegressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.StandardScaler(random_seed=0, **kwargs)[source]#

A transformer that standardizes input features by removing the mean and scaling to unit variance.

Parameters: random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{}
modifies_features	True
modifies_target	False
name	Standard Scaler
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits the standard scalar on the given data.
`fit_transform`	Fit and transform data using the standard scaler component.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transform data using the fitted standard scaler.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the standard scalar on the given data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y=None)[source]#

Fit and transform data using the standard scaler component.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Transformed data.

Return type

pd.DataFrame

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transform data using the fitted standard scaler.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Transformed data.

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.SVMClassifier(C=1.0, kernel='rbf', gamma='auto', probability=True, random_seed=0, **kwargs)[source]#

Support Vector Machine Classifier.

Parameters

C (float) – The regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. Defaults to 1.0.
kernel ({"poly", "rbf", "sigmoid"}) – Specifies the kernel type to be used in the algorithm. Defaults to “rbf”.
gamma ({"scale", "auto"} or float) – Kernel coefficient for “rbf”, “poly” and “sigmoid”. Defaults to “auto”. - If gamma=’scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma - If “auto” (default), uses 1 / n_features
probability (boolean) – Whether to enable probability estimates. Defaults to True.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “C”: Real(0, 10), “kernel”: [“poly”, “rbf”, “sigmoid”], “gamma”: [“scale”, “auto”],}
model_family	ModelFamily.SVM
modifies_features	True
modifies_target	False
name	SVM Classifier
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance only works with linear kernels.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#

Feature importance only works with linear kernels.

If the kernel isn’t linear, we return a numpy array of zeros.

Returns: Feature importance of fitted SVM classifier or a numpy array of zeroes if the kernel is not linear.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.SVMRegressor(C=1.0, kernel='rbf', gamma='auto', random_seed=0, **kwargs)[source]#

Support Vector Machine Regressor.

Parameters

C (float) – The regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. Defaults to 1.0.
kernel ({"poly", "rbf", "sigmoid"}) – Specifies the kernel type to be used in the algorithm. Defaults to “rbf”.
gamma ({"scale", "auto"} or float) – Kernel coefficient for “rbf”, “poly” and “sigmoid”. Defaults to “auto”. - If gamma=’scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma - If “auto” (default), uses 1 / n_features
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{ “C”: Real(0, 10), “kernel”: [“poly”, “rbf”, “sigmoid”], “gamma”: [“scale”, “auto”],}
model_family	ModelFamily.SVM
modifies_features	True
modifies_target	False
name	SVM Regressor
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance of fitted SVM regresor.
`fit`	Fits estimator to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#

Feature importance of fitted SVM regresor.

Only works with linear kernels. If the kernel isn’t linear, we return a numpy array of zeros.

Returns: The feature importance of the fitted SVM regressor, or an array of zeroes if the kernel is not linear.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)#

Fits estimator to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series#

Make predictions using selected features.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict method or a component_obj that implements predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.TargetEncoder(cols=None, smoothing=1, handle_unknown='value', handle_missing='value', random_seed=0, **kwargs)[source]#

A transformer that encodes categorical features into target encodings.

Parameters

cols (list) – Columns to encode. If None, all string columns will be encoded, otherwise only the columns provided will be encoded. Defaults to None
smoothing (float) – The smoothing factor to apply. The larger this value is, the more influence the expected target value has on the resulting target encodings. Must be strictly larger than 0. Defaults to 1.0
handle_unknown (string) – Determines how to handle unknown categories for a feature encountered. Options are ‘value’, ‘error’, nd ‘return_nan’. Defaults to ‘value’, which replaces with the target mean
handle_missing (string) – Determines how to handle missing values encountered during fit or transform. Options are ‘value’, ‘error’, and ‘return_nan’. Defaults to ‘value’, which replaces with the target mean
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges	{}
modifies_features	True
modifies_target	False
name	Target Encoder
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits the target encoder.
`fit_transform`	Fit and transform data using the target encoder.
`get_feature_names`	Return feature names for the input features after fitting.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transform data using the fitted target encoder.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y)[source]#

Fits the target encoder.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y)[source]#

Fit and transform data using the target encoder.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Transformed data.

Return type

pd.DataFrame

get_feature_names(self)[source]#

Return feature names for the input features after fitting.

Returns: The feature names after encoding.
Return type: np.array

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transform data using the fitted target encoder.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Transformed data.

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.TimeSeriesBinaryClassificationPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline base class for time series binary classification problems.

Parameters

component_graph (list or dict) – List of components in order. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary {} implies using all default values for component parameters. Pipeline-level parameters such as time_index, gap, and max_delay must be specified with the “pipeline” key. For example: Pipeline(parameters={“pipeline”: {“time_index”: “Date”, “max_delay”: 4, “gap”: 2}}).
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> pipeline = TimeSeriesBinaryClassificationPipeline(component_graph=["Simple Imputer", "Logistic Regression Classifier"],
...                                                   parameters={"Logistic Regression Classifier": {"penalty": "elasticnet",
...                                                                                                  "solver": "liblinear"},
...                                                               "pipeline": {"gap": 1, "max_delay": 1, "forecast_horizon": 1, "time_index": "date"}},
...                                                   custom_name="My TimeSeriesBinary Pipeline")
...
>>> assert pipeline.custom_name == "My TimeSeriesBinary Pipeline"
>>> assert pipeline.component_graph.component_dict.keys() == {'Simple Imputer', 'Logistic Regression Classifier'}
...
>>> assert pipeline.parameters == {
...     'Simple Imputer': {'impute_strategy': 'most_frequent', 'fill_value': None},
...     'Logistic Regression Classifier': {'penalty': 'elasticnet',
...                                         'C': 1.0,
...                                         'n_jobs': -1,
...                                         'multi_class': 'auto',
...                                         'solver': 'liblinear'},
...     'pipeline': {'gap': 1, 'max_delay': 1, 'forecast_horizon': 1, 'time_index': "date"}}

Attributes

problem_type

None

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`classes_`	Gets the class names for the pipeline. Will return None before pipeline is fit.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`dates_needed_for_prediction`	Return dates needed to forecast the given date in the future.
`dates_needed_for_prediction_range`	Return dates needed to forecast the given date in the future.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Fit a time series classification model.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`optimize_threshold`	Optimize the pipeline threshold given the objective to use. Only used for binary problems with objectives whose thresholds can be tuned.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Predict on future data where target is not known.
`predict_in_sample`	Predict on future data where the target is known, e.g. cross validation.
`predict_proba`	Predict on future data where the target is unknown.
`predict_proba_in_sample`	Predict on future data where the target is known, e.g. cross validation.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on current and additional objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`threshold`	Threshold used to make a prediction. Defaults to None.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

property classes_(self)#: Gets the class names for the pipeline. Will return None before pipeline is fit.

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

dates_needed_for_prediction(self, date)#

Return dates needed to forecast the given date in the future.

Parameters: date (pd.Timestamp) – Date to forecast in the future.
Returns: Range of dates needed to forecast the given date.
Return type: dates_needed (tuple(pd.Timestamp))

dates_needed_for_prediction_range(self, start_date, end_date)#

Return dates needed to forecast the given date in the future.

Parameters

start_date (pd.Timestamp) – Start date of range to forecast in the future.
end_date (pd.Timestamp) – End date of range to forecast in the future.

Returns

Range of dates needed to forecast the given date.

Return type

dates_needed (tuple(pd.Timestamp))

Raises

ValueError – If start_date doesn’t come before end_date

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)#

Fit a time series classification model.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target training labels of length [n_samples]

Returns

self

Raises

ValueError – If the number of unique classes in y are not appropriate for the type of pipeline.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

optimize_threshold(self, X, y, y_pred_proba, objective)#

Optimize the pipeline threshold given the objective to use. Only used for binary problems with objectives whose thresholds can be tuned.

Parameters

X (pd.DataFrame) – Input features.
y (pd.Series) – Input target values.
y_pred_proba (pd.Series) – The predicted probabilities of the target outputted by the pipeline.
objective (ObjectiveBase) – The objective to threshold with. Must have a tunable threshold.

Raises

ValueError – If objective is not optimizable.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)#

Predict on future data where target is not known.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame or np.ndarray or None) – Training data.
y_train (pd.Series or None) – Training labels.

Raises

ValueError – If X_train and/or y_train are None or if final component is not an Estimator.

Returns

Predictions.

predict_in_sample(self, X, y, X_train, y_train, objective=None)[source]#

Predict on future data where the target is known, e.g. cross validation.

Parameters

X (pd.DataFrame) – Future data of shape [n_samples, n_features].
y (pd.Series) – Future target of shape [n_samples].
X_train (pd.DataFrame) – Data the pipeline was trained on of shape [n_samples_train, n_feautures].
y_train (pd.Series) – Targets used to train the pipeline of shape [n_samples_train].
objective (ObjectiveBase, str) – Objective used to threshold predicted probabilities, optional. Defaults to None.

Returns

Estimated labels.

Return type

pd.Series

Raises

ValueError – If objective is not defined for time-series binary classification problems.

predict_proba(self, X, X_train=None, y_train=None)#

Predict on future data where the target is unknown.

Parameters

X (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features].
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Estimated probabilities.

Return type

pd.Series

Raises

ValueError – If final component is not an Estimator.

predict_proba_in_sample(self, X_holdout, y_holdout, X_train, y_train)#

Predict on future data where the target is known, e.g. cross validation.

Parameters

X_holdout (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features].
y_holdout (pd.Series, np.ndarray) – Future target of shape [n_samples].
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Estimated probabilities.

Return type

pd.Series

Raises

ValueError – If the final component is not an Estimator.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)#

Evaluate model performance on current and additional objectives.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – True labels of length [n_samples].
objectives (list) – Non-empty list of objectives to score on.
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

property threshold(self)#: Threshold used to make a prediction. Defaults to None.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None, calculating_residuals=False)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series) – Targets corresponding to the pipeline targets.
X_train (pd.DataFrame) – Training data used to generate generates from past observations.
y_train (pd.Series) – Training targets used to generate features from past observations.
calculating_residuals (bool) – Whether we’re calling predict_in_sample to calculate the residuals. This means the X and y arguments are not future data, but actually the train data.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.TimeSeriesClassificationPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline base class for time series classification problems.

Parameters

component_graph (ComponentGraph, list, dict) – ComponentGraph instance, list of components in order, or dictionary of components. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary {} implies using all default values for component parameters. Pipeline-level parameters such as time_index, gap, and max_delay must be specified with the “pipeline” key. For example: Pipeline(parameters={“pipeline”: {“time_index”: “Date”, “max_delay”: 4, “gap”: 2}}).
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

problem_type

None

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`classes_`	Gets the class names for the pipeline. Will return None before pipeline is fit.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`dates_needed_for_prediction`	Return dates needed to forecast the given date in the future.
`dates_needed_for_prediction_range`	Return dates needed to forecast the given date in the future.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Fit a time series classification model.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Predict on future data where target is not known.
`predict_in_sample`	Predict on future data where the target is known, e.g. cross validation.
`predict_proba`	Predict on future data where the target is unknown.
`predict_proba_in_sample`	Predict on future data where the target is known, e.g. cross validation.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on current and additional objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

property classes_(self)#: Gets the class names for the pipeline. Will return None before pipeline is fit.

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

dates_needed_for_prediction(self, date)#

Return dates needed to forecast the given date in the future.

Parameters: date (pd.Timestamp) – Date to forecast in the future.
Returns: Range of dates needed to forecast the given date.
Return type: dates_needed (tuple(pd.Timestamp))

dates_needed_for_prediction_range(self, start_date, end_date)#

Return dates needed to forecast the given date in the future.

Parameters

start_date (pd.Timestamp) – Start date of range to forecast in the future.
end_date (pd.Timestamp) – End date of range to forecast in the future.

Returns

Range of dates needed to forecast the given date.

Return type

dates_needed (tuple(pd.Timestamp))

Raises

ValueError – If start_date doesn’t come before end_date

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)[source]#

Fit a time series classification model.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target training labels of length [n_samples]

Returns

self

Raises

ValueError – If the number of unique classes in y are not appropriate for the type of pipeline.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)#

Predict on future data where target is not known.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame or np.ndarray or None) – Training data.
y_train (pd.Series or None) – Training labels.

Raises

ValueError – If X_train and/or y_train are None or if final component is not an Estimator.

Returns

Predictions.

predict_in_sample(self, X, y, X_train, y_train, objective=None)[source]#

Predict on future data where the target is known, e.g. cross validation.

Note: we cast y as ints first to address boolean values that may be returned from calculating predictions which we would not be able to otherwise transform if we originally had integer targets.

Parameters

X (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features].
y (pd.Series, np.ndarray) – Future target of shape [n_samples].
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].
objective (ObjectiveBase, str, None) – Objective used to threshold predicted probabilities, optional.

Returns

Estimated labels.

Return type

pd.Series

Raises

ValueError – If final component is not an Estimator.

predict_proba(self, X, X_train=None, y_train=None)[source]#

Predict on future data where the target is unknown.

Parameters

X (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features].
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Estimated probabilities.

Return type

pd.Series

Raises

ValueError – If final component is not an Estimator.

predict_proba_in_sample(self, X_holdout, y_holdout, X_train, y_train)[source]#

Predict on future data where the target is known, e.g. cross validation.

Parameters

X_holdout (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features].
y_holdout (pd.Series, np.ndarray) – Future target of shape [n_samples].
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Estimated probabilities.

Return type

pd.Series

Raises

ValueError – If the final component is not an Estimator.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)[source]#

Evaluate model performance on current and additional objectives.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – True labels of length [n_samples].
objectives (list) – Non-empty list of objectives to score on.
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None, calculating_residuals=False)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series) – Targets corresponding to the pipeline targets.
X_train (pd.DataFrame) – Training data used to generate generates from past observations.
y_train (pd.Series) – Training targets used to generate features from past observations.
calculating_residuals (bool) – Whether we’re calling predict_in_sample to calculate the residuals. This means the X and y arguments are not future data, but actually the train data.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.TimeSeriesFeaturizer(time_index=None, max_delay=2, gap=0, forecast_horizon=1, conf_level=0.05, rolling_window_size=0.25, delay_features=True, delay_target=True, random_seed=0, **kwargs)[source]#

Transformer that delays input features and target variable for time series problems.

This component uses an algorithm based on the autocorrelation values of the target variable to determine which lags to select from the set of all possible lags.

The algorithm is based on the idea that the local maxima of the autocorrelation function indicate the lags that have the most impact on the present time.

The algorithm computes the autocorrelation values and finds the local maxima, called “peaks”, that are significant at the given conf_level. Since lags in the range [0, 10] tend to be predictive but not local maxima, the union of the peaks is taken with the significant lags in the range [0, 10]. At the end, only selected lags in the range [0, max_delay] are used.

Parametrizing the algorithm by conf_level lets the AutoMLAlgorithm tune the set of lags chosen so that the chances of finding a good set of lags is higher.

Using conf_level value of 1 selects all possible lags.

Parameters

time_index (str) – Name of the column containing the datetime information used to order the data. Ignored.
max_delay (int) – Maximum number of time units to delay each feature. Defaults to 2.
forecast_horizon (int) – The number of time periods the pipeline is expected to forecast.
conf_level (float) – Float in range (0, 1] that determines the confidence interval size used to select which lags to compute from the set of [1, max_delay]. A delay of 1 will always be computed. If 1, selects all possible lags in the set of [1, max_delay], inclusive.
rolling_window_size (float) – Float in range (0, 1] that determines the size of the window used for rolling features. Size is computed as rolling_window_size * max_delay.
delay_features (bool) – Whether to delay the input features. Defaults to True.
delay_target (bool) – Whether to delay the target. Defaults to True.
gap (int) – The number of time units between when the features are collected and when the target is collected. For example, if you are predicting the next time step’s target, gap=1. This is only needed because when gap=0, we need to be sure to start the lagging of the target variable at 1. Defaults to 1.
random_seed (int) – Seed for the random number generator. This transformer performs the same regardless of the random seed provided.

Attributes

df_colname_prefix	{}_delay_{}
hyperparameter_ranges	Real(0.001, 1.0), “rolling_window_size”: Real(0.001, 1.0)}:type: {“conf_level”
modifies_features	True
modifies_target	False
name	Time Series Featurizer
needs_fitting	True
target_colname_prefix	target_delay_{}
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits the DelayFeatureTransformer.
`fit_transform`	Fit the component and transform the input data.
`load`	Loads component at file path.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Computes the delayed values and rolling means for X and y.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the DelayFeatureTransformer.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

ValueError – if self.time_index is None

fit_transform(self, X, y=None)[source]#

Fit the component and transform the input data.

Parameters

X (pd.DataFrame) – Data to transform.
y (pd.Series, or None) – Target.

Returns

Transformed X.

Return type

pd.DataFrame

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Computes the delayed values and rolling means for X and y.

The chosen delays are determined by the autocorrelation function of the target variable. See the class docstring for more information on how they are chosen. If y is None, all possible lags are chosen.

If y is not None, it will also compute the delayed values for the target variable.

The rolling means for all numeric features in X and y, if y is numeric, are also returned.

Parameters

X (pd.DataFrame or None) – Data to transform. None is expected when only the target variable is being used.
y (pd.Series, or None) – Target.

Returns

Transformed X. No original features are returned.

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.TimeSeriesImputer(categorical_impute_strategy='forwards_fill', numeric_impute_strategy='interpolate', target_impute_strategy='forwards_fill', random_seed=0, **kwargs)[source]#

Imputes missing data according to a specified timeseries-specific imputation strategy.

This Transformer should be used after the TimeSeriesRegularizer in order to impute the missing values that were added to X and y (if passed).

Parameters

categorical_impute_strategy (string) – Impute strategy to use for string, object, boolean, categorical dtypes. Valid values include “backwards_fill” and “forwards_fill”. Defaults to “forwards_fill”.
numeric_impute_strategy (string) – Impute strategy to use for numeric columns. Valid values include “backwards_fill”, “forwards_fill”, and “interpolate”. Defaults to “interpolate”.
target_impute_strategy (string) – Impute strategy to use for the target column. Valid values include “backwards_fill”, “forwards_fill”, and “interpolate”. Defaults to “forwards_fill”.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Raises

ValueError – If categorical_impute_strategy, numeric_impute_strategy, or target_impute_strategy is not one of the valid values.

Attributes

hyperparameter_ranges	{ “categorical_impute_strategy”: [“backwards_fill”, “forwards_fill”], “numeric_impute_strategy”: [“backwards_fill”, “forwards_fill”, “interpolate”], “target_impute_strategy”: [“backwards_fill”, “forwards_fill”, “interpolate”],}
modifies_features	True
modifies_target	True
name	Time Series Imputer
training_only	True

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits imputer to data.
`fit_transform`	Fits on X and transforms X.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms data X by imputing missing values using specified timeseries-specific strategies. 'None' values are converted to np.nan before imputation and are treated as the same.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits imputer to data.

‘None’ values are converted to np.nan before imputation and are treated as the same. If a value is missing at the beginning or end of a column, that value will be imputed using backwards fill or forwards fill as necessary, respectively.

Parameters

X (pd.DataFrame, np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data X by imputing missing values using specified timeseries-specific strategies. ‘None’ values are converted to np.nan before imputation and are treated as the same.

Parameters

X (pd.DataFrame) – Data to transform.
y (pd.Series, optional) – Optionally, target data to transform.

Returns

Transformed X and y

Return type

pd.DataFrame

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.TimeSeriesMulticlassClassificationPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline base class for time series multiclass classification problems.

Parameters

component_graph (list or dict) – List of components in order. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary {} implies using all default values for component parameters. Pipeline-level parameters such as time_index, gap, and max_delay must be specified with the “pipeline” key. For example: Pipeline(parameters={“pipeline”: {“time_index”: “Date”, “max_delay”: 4, “gap”: 2}}).
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> pipeline = TimeSeriesMulticlassClassificationPipeline(component_graph=["Simple Imputer", "Logistic Regression Classifier"],
...                                                       parameters={"Logistic Regression Classifier": {"penalty": "elasticnet",
...                                                                                                      "solver": "liblinear"},
...                                                                   "pipeline": {"gap": 1, "max_delay": 1, "forecast_horizon": 1, "time_index": "date"}},
...                                                       custom_name="My TimeSeriesMulticlass Pipeline")
>>> assert pipeline.custom_name == "My TimeSeriesMulticlass Pipeline"
>>> assert pipeline.component_graph.component_dict.keys() == {'Simple Imputer', 'Logistic Regression Classifier'}
>>> assert pipeline.parameters == {
...  'Simple Imputer': {'impute_strategy': 'most_frequent', 'fill_value': None},
...  'Logistic Regression Classifier': {'penalty': 'elasticnet',
...                                     'C': 1.0,
...                                     'n_jobs': -1,
...                                     'multi_class': 'auto',
...                                     'solver': 'liblinear'},
...     'pipeline': {'gap': 1, 'max_delay': 1, 'forecast_horizon': 1, 'time_index': "date"}}

Attributes

problem_type

ProblemTypes.TIME_SERIES_MULTICLASS

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`classes_`	Gets the class names for the pipeline. Will return None before pipeline is fit.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`dates_needed_for_prediction`	Return dates needed to forecast the given date in the future.
`dates_needed_for_prediction_range`	Return dates needed to forecast the given date in the future.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Fit a time series classification model.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Predict on future data where target is not known.
`predict_in_sample`	Predict on future data where the target is known, e.g. cross validation.
`predict_proba`	Predict on future data where the target is unknown.
`predict_proba_in_sample`	Predict on future data where the target is known, e.g. cross validation.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on current and additional objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

property classes_(self)#: Gets the class names for the pipeline. Will return None before pipeline is fit.

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

dates_needed_for_prediction(self, date)#

Return dates needed to forecast the given date in the future.

Parameters: date (pd.Timestamp) – Date to forecast in the future.
Returns: Range of dates needed to forecast the given date.
Return type: dates_needed (tuple(pd.Timestamp))

dates_needed_for_prediction_range(self, start_date, end_date)#

Return dates needed to forecast the given date in the future.

Parameters

start_date (pd.Timestamp) – Start date of range to forecast in the future.
end_date (pd.Timestamp) – End date of range to forecast in the future.

Returns

Range of dates needed to forecast the given date.

Return type

dates_needed (tuple(pd.Timestamp))

Raises

ValueError – If start_date doesn’t come before end_date

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)#

Fit a time series classification model.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target training labels of length [n_samples]

Returns

self

Raises

ValueError – If the number of unique classes in y are not appropriate for the type of pipeline.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)#

Predict on future data where target is not known.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame or np.ndarray or None) – Training data.
y_train (pd.Series or None) – Training labels.

Raises

ValueError – If X_train and/or y_train are None or if final component is not an Estimator.

Returns

Predictions.

predict_in_sample(self, X, y, X_train, y_train, objective=None)#

Predict on future data where the target is known, e.g. cross validation.

Note: we cast y as ints first to address boolean values that may be returned from calculating predictions which we would not be able to otherwise transform if we originally had integer targets.

Parameters

X (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features].
y (pd.Series, np.ndarray) – Future target of shape [n_samples].
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].
objective (ObjectiveBase, str, None) – Objective used to threshold predicted probabilities, optional.

Returns

Estimated labels.

Return type

pd.Series

Raises

ValueError – If final component is not an Estimator.

predict_proba(self, X, X_train=None, y_train=None)#

Predict on future data where the target is unknown.

Parameters

X (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features].
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Estimated probabilities.

Return type

pd.Series

Raises

ValueError – If final component is not an Estimator.

predict_proba_in_sample(self, X_holdout, y_holdout, X_train, y_train)#

Predict on future data where the target is known, e.g. cross validation.

Parameters

X_holdout (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features].
y_holdout (pd.Series, np.ndarray) – Future target of shape [n_samples].
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Estimated probabilities.

Return type

pd.Series

Raises

ValueError – If the final component is not an Estimator.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)#

Evaluate model performance on current and additional objectives.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – True labels of length [n_samples].
objectives (list) – Non-empty list of objectives to score on.
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None, calculating_residuals=False)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series) – Targets corresponding to the pipeline targets.
X_train (pd.DataFrame) – Training data used to generate generates from past observations.
y_train (pd.Series) – Training targets used to generate features from past observations.
calculating_residuals (bool) – Whether we’re calling predict_in_sample to calculate the residuals. This means the X and y arguments are not future data, but actually the train data.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.TimeSeriesRegressionPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline base class for time series regression problems.

Parameters

component_graph (ComponentGraph, list, dict) – ComponentGraph instance, list of components in order, or dictionary of components. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary {} implies using all default values for component parameters. Pipeline-level parameters such as time_index, gap, and max_delay must be specified with the “pipeline” key. For example: Pipeline(parameters={“pipeline”: {“time_index”: “Date”, “max_delay”: 4, “gap”: 2}}).
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> pipeline = TimeSeriesRegressionPipeline(component_graph=["Simple Imputer", "Linear Regressor"],
...                                                       parameters={"Simple Imputer": {"impute_strategy": "mean"},
...                                                                   "pipeline": {"gap": 1, "max_delay": 1, "forecast_horizon": 1, "time_index": "date"}},
...                                                       custom_name="My TimeSeriesRegression Pipeline")
...
>>> assert pipeline.custom_name == "My TimeSeriesRegression Pipeline"
>>> assert pipeline.component_graph.component_dict.keys() == {'Simple Imputer', 'Linear Regressor'}

The pipeline parameters will be chosen from the default parameters for every component, unless specific parameters were passed in as they were above.

>>> assert pipeline.parameters == {
...     'Simple Imputer': {'impute_strategy': 'mean', 'fill_value': None},
...     'Linear Regressor': {'fit_intercept': True, 'n_jobs': -1},
...     'pipeline': {'gap': 1, 'max_delay': 1, 'forecast_horizon': 1, 'time_index': "date"}}

Attributes

NO_PREDS_PI_ESTIMATORS	ProblemTypes.TIME_SERIES_REGRESSION
problem_type	None

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`dates_needed_for_prediction`	Return dates needed to forecast the given date in the future.
`dates_needed_for_prediction_range`	Return dates needed to forecast the given date in the future.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Fit a time series pipeline.
`fit_transform`	Fit and transform all components in the component graph, if all components are Transformers.
`get_component`	Returns component by name.
`get_forecast_period`	Generates all possible forecasting time points based on latest data point in X.
`get_forecast_predictions`	Generates all possible forecasting predictions based on last period of X.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Predict on future data where target is not known.
`predict_in_sample`	Predict on future data where the target is known, e.g. cross validation.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on current and additional objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

dates_needed_for_prediction(self, date)#

Return dates needed to forecast the given date in the future.

Parameters: date (pd.Timestamp) – Date to forecast in the future.
Returns: Range of dates needed to forecast the given date.
Return type: dates_needed (tuple(pd.Timestamp))

dates_needed_for_prediction_range(self, start_date, end_date)#

Return dates needed to forecast the given date in the future.

Parameters

start_date (pd.Timestamp) – Start date of range to forecast in the future.
end_date (pd.Timestamp) – End date of range to forecast in the future.

Returns

Range of dates needed to forecast the given date.

Return type

dates_needed (tuple(pd.Timestamp))

Raises

ValueError – If start_date doesn’t come before end_date

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)[source]#

Fit a time series pipeline.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features].
y (pd.Series, np.ndarray) – The target training targets of length [n_samples].

Returns

self

Raises

ValueError – If the target is not numeric.

fit_transform(self, X, y)#

Fit and transform all components in the component graph, if all components are Transformers.

Parameters

X (pd.DataFrame) – Input features of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples].

Returns

Transformed output.

Return type

pd.DataFrame

Raises

ValueError – If final component is an Estimator.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_forecast_period(self, X)[source]#

Generates all possible forecasting time points based on latest data point in X.

Parameters: X (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_feautures].
Raises: ValueError – If pipeline is not trained.
Returns: Datetime periods from gap to forecast_horizon + gap.
Return type: pd.Series

Example

>>> X = pd.DataFrame({'date': pd.date_range(start='1-1-2022', periods=10, freq='D'), 'feature': range(10, 20)})
>>> y = pd.Series(range(0, 10), name='target')
>>> gap = 1
>>> forecast_horizon = 2
>>> pipeline = TimeSeriesRegressionPipeline(component_graph=["Linear Regressor"],
...                                         parameters={"Simple Imputer": {"impute_strategy": "mean"},
...                                                     "pipeline": {"gap": gap, "max_delay": 1, "forecast_horizon": forecast_horizon, "time_index": "date"}},
...                                        )
>>> pipeline.fit(X, y)
pipeline = TimeSeriesRegressionPipeline(component_graph={'Linear Regressor': ['Linear Regressor', 'X', 'y']}, parameters={'Linear Regressor':{'fit_intercept': True, 'n_jobs': -1}, 'pipeline':{'gap': 1, 'max_delay': 1, 'forecast_horizon': 2, 'time_index': 'date'}}, random_seed=0)
>>> dates = pipeline.get_forecast_period(X)
>>> expected = pd.Series(pd.date_range(start='2022-01-11', periods=forecast_horizon, freq='D').shift(gap), name='date', index=[10, 11])
>>> assert dates.equals(expected)

get_forecast_predictions(self, X, y)[source]#

Generates all possible forecasting predictions based on last period of X.

Parameters

X (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_feautures].
y (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Predictions from gap periods out to forecast_horizon + gap periods.

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

get_prediction_intervals(self, X, y=None, X_train=None, y_train=None, coverage=None)[source]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data.
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_features].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDecomposer, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path: Union[str, io.BytesIO])#

Loads pipeline at file path.

Parameters: file_path (str|BytesIO) – load filepath or a BytesIO object.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)#

Predict on future data where target is not known.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame or np.ndarray or None) – Training data.
y_train (pd.Series or None) – Training labels.

Raises

ValueError – If X_train and/or y_train are None or if final component is not an Estimator.

Returns

Predictions.

predict_in_sample(self, X, y, X_train, y_train, objective=None, calculating_residuals=False)#

Predict on future data where the target is known, e.g. cross validation.

Parameters

X (pd.DataFrame or np.ndarray) – Future data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – Future target of shape [n_samples]
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_feautures]
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train]
objective (ObjectiveBase, str, None) – Objective used to threshold predicted probabilities, optional.
calculating_residuals (bool) – Whether we’re calling predict_in_sample to calculate the residuals. This means the X and y arguments are not future data, but actually the train data.

Returns

Estimated labels.

Return type

pd.Series

Raises

ValueError – If final component is not an Estimator.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)[source]#

Evaluate model performance on current and additional objectives.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – True labels of length [n_samples].
objectives (list) – Non-empty list of objectives to score on.
X_train (pd.DataFrame, np.ndarray) – Data the pipeline was trained on of shape [n_samples_train, n_feautures].
y_train (pd.Series, np.ndarray) – Targets used to train the pipeline of shape [n_samples_train].

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None, calculating_residuals=False)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series) – Targets corresponding to the pipeline targets.
X_train (pd.DataFrame) – Training data used to generate generates from past observations.
y_train (pd.Series) – Training targets used to generate features from past observations.
calculating_residuals (bool) – Whether we’re calling predict_in_sample to calculate the residuals. This means the X and y arguments are not future data, but actually the train data.

Returns

New transformed features.

Return type

pd.DataFrame

class evalml.pipelines.TimeSeriesRegularizer(time_index=None, frequency_payload=None, window_length=4, threshold=0.4, random_seed=0, **kwargs)[source]#

Transformer that regularizes an inconsistently spaced datetime column.

If X is passed in to fit/transform, the column time_index will be checked for an inferrable offset frequency. If the time_index column is perfectly inferrable then this Transformer will do nothing and return the original X and y.

If X does not have a perfectly inferrable frequency but one can be estimated, then X and y will be reformatted based on the estimated frequency for time_index. In the original X and y passed: - Missing datetime values will be added and will have their corresponding columns in X and y set to None. - Duplicate datetime values will be dropped. - Extra datetime values will be dropped. - If it can be determined that a duplicate or extra value is misaligned, then it will be repositioned to take the place of a missing value.

This Transformer should be used before the TimeSeriesImputer in order to impute the missing values that were added to X and y (if passed).

If used on multiseries dataset, works specifically on unstacked datasets.

Parameters

time_index (string) – Name of the column containing the datetime information used to order the data, required. Defaults to None.
frequency_payload (tuple) – Payload returned from Woodwork’s infer_frequency function where debug is True. Defaults to None.
window_length (int) – The size of the rolling window over which inference is conducted to determine the prevalence of uninferrable frequencies.
5. (Lower values make this component more sensitive to recognizing numerous faulty datetime values. Defaults to) –
threshold (float) – The minimum percentage of windows that need to have been able to infer a frequency. Lower values make this component more
0.8. (sensitive to recognizing numerous faulty datetime values. Defaults to) –
random_seed (int) – Seed for the random number generator. This transformer performs the same regardless of the random seed provided.
0. (Defaults to) –

Raises

ValueError – if the frequency_payload parameter has not been passed a tuple

Attributes

hyperparameter_ranges	{}
modifies_features	True
modifies_target	True
name	Time Series Regularizer
training_only	True

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits the TimeSeriesRegularizer.
`fit_transform`	Fits on X and transforms X.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Regularizes a dataframe and target data to an inferrable offset frequency.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the TimeSeriesRegularizer.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

Raises

ValueError – if self.time_index is None, if X and y have different lengths, if time_index in X does not have an offset frequency that can be estimated
TypeError – if the time_index column is not of type Datetime
KeyError – if the time_index column doesn’t exist

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Regularizes a dataframe and target data to an inferrable offset frequency.

A ‘clean’ X and y (if y was passed in) are created based on an inferrable offset frequency and matching datetime values with the original X and y are imputed into the clean X and y. Datetime values identified as misaligned are shifted into their appropriate position.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Data with an inferrable time_index offset frequency.

Return type

(pd.DataFrame, pd.Series)

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.Transformer(parameters=None, component_obj=None, random_seed=0, **kwargs)[source]#

A component that may or may not need fitting that transforms data. These components are used before an estimator.

To implement a new Transformer, define your own class which is a subclass of Transformer, including a name and a list of acceptable ranges for any parameters to be tuned during the automl search (hyperparameters). Define an __init__ method which sets up any necessary state and objects. Make sure your __init__ only uses standard keyword arguments and calls super().__init__() with a parameters dict. You may also override the fit, transform, fit_transform and other methods in this class if appropriate.

To see some examples, check out the definitions of any Transformer component.

Parameters

parameters (dict) – Dictionary of parameters for the component. Defaults to None.
component_obj (obj) – Third-party objects useful in component implementation. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

modifies_features	True
modifies_target	False
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`fit`	Fits component to data.
`fit_transform`	Fits on X and transforms X.
`load`	Loads component at file path.
`name`	Returns string name of this component.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`save`	Saves component at file path.
`transform`	Transforms data X.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)#

Fits component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features]
y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

MethodPropertyNotFoundError – If component does not have a fit method or a component_obj that implements fit.

fit_transform(self, X, y=None)[source]#

Fits on X and transforms X.

Parameters

X (pd.DataFrame) – Data to fit and transform.
y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

property name(cls)#: Returns string name of this component.

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

abstract transform(self, X, y=None)[source]#

Transforms data X.

Parameters

X (pd.DataFrame) – Data to transform.
y (pd.Series, optional) – Target data.

Returns

Transformed X

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.VARMAXRegressor(time_index: Optional[Hashable] = None, p: int = 1, q: int = 0, trend: Optional[str] = 'c', random_seed: Union[int, float] = 0, maxiter: int = 10, use_covariates: bool = False, **kwargs)[source]#

Vector Autoregressive Moving Average with eXogenous regressors model. The two parameters (p, q) are the AR order and the MA order. More information here: https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.varmax.VARMAX.html.

Currently VARMAXRegressor isn’t supported via conda install. It’s recommended that it be installed via PyPI.

Parameters

time_index (str) – Specifies the name of the column in X that provides the datetime objects. Defaults to None.
p (int) – Maximum Autoregressive order. Defaults to 1.
q (int) – Maximum Moving Average order. Defaults to 0.
trend (str) – Controls the deterministic trend. Options are [‘n’, ‘c’, ‘t’, ‘ct’] where ‘c’ is a constant term, ‘t’ indicates a linear trend, and ‘ct’ is both. Can also be an iterable when defining a polynomial, such as [1, 1, 0, 1].
random_seed (int) – Seed for the random number generator. Defaults to 0.
max_iter (int) – Maximum number of iterations for solver. Defaults to 10.
use_covariates (bool) – If True, will pass exogenous variables in fit/predict methods. If False, forecasts will solely be based off of the datetimes and target values. Defaults to True.

Attributes

hyperparameter_ranges	{ “p”: Integer(1, 10), “q”: Integer(1, 10), “trend”: Categorical([‘n’, ‘c’, ‘t’, ‘ct’]),}
model_family	ModelFamily.VARMAX
modifies_features	True
modifies_target	False
name	VARMAX Regressor
supported_problem_types	[ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Returns array of 0's with a length of 1 as feature_importance is not defined for VARMAX regressor.
`fit`	Fits VARMAX regressor to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted VARMAXRegressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using fitted VARMAX regressor.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → numpy.ndarray#: Returns array of 0’s with a length of 1 as feature_importance is not defined for VARMAX regressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.DataFrame] = None)[source]#

Fits VARMAX regressor to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.DataFrane) – The target training data of shape [n_samples, n_series_id_values].

Returns

self

Raises

ValueError – If y was not passed in.

get_prediction_intervals(self, X: pandas.DataFrame, y: pandas.DataFrame = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series][source]#

Find the prediction intervals using the fitted VARMAXRegressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.DataFrame) – Target data of shape [n_samples, n_series_id_values]. Optional.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Not used for VARMAX regressor.

Returns

A dict of prediction intervals, where the dict is in the format {series_id: {coverage}_lower or {coverage}_upper}.

Return type

dict[dict]

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame, y: Optional[pandas.DataFrame] = None) → pandas.Series[source]#

Make predictions using fitted VARMAX regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.DataFrame) – Target data of shape [n_samples, n_series_id_values].

Returns

Predicted values.

Return type

pd.Series

Raises

ValueError – If X was passed to fit but not passed in predict.

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.XGBoostClassifier(eta=0.1, max_depth=6, min_child_weight=1, n_estimators=100, random_seed=0, eval_metric='logloss', n_jobs=12, **kwargs)[source]#

XGBoost Classifier.

Parameters

eta (float) – Boosting learning rate. Defaults to 0.1.
max_depth (int) – Maximum tree depth for base learners. Defaults to 6.
min_child_weight (float) – Minimum sum of instance weight (hessian) needed in a child. Defaults to 1.0
n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds. Defaults to 100.
random_seed (int) – Seed for the random number generator. Defaults to 0.
n_jobs (int) – Number of parallel threads used to run xgboost. Note that creating thread contention will significantly slow down the algorithm. Defaults to 12.

Attributes

hyperparameter_ranges	{ “eta”: Real(0.000001, 1), “max_depth”: Integer(1, 10), “min_child_weight”: Real(1, 10), “n_estimators”: Integer(1, 1000),}
model_family	ModelFamily.XGBOOST
modifies_features	True
modifies_target	False
name	XGBoost Classifier
SEED_MAX	None
SEED_MIN	None
supported_problem_types	[ ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.TIME_SERIES_BINARY, ProblemTypes.TIME_SERIES_MULTICLASS,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance of fitted XGBoost classifier.
`fit`	Fits XGBoost classifier component to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted regressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using the fitted XGBoost classifier.
`predict_proba`	Make predictions using the fitted CatBoost classifier.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self)#: Feature importance of fitted XGBoost classifier.

fit(self, X, y=None)[source]#

Fits XGBoost classifier component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series) – The target training data of length [n_samples].

Returns

self

get_prediction_intervals(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None, coverage: List[float] = None, predictions: pandas.Series = None) → Dict[str, pandas.Series]#

Find the prediction intervals using the fitted regressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

Raises

MethodPropertyNotFoundError – If the estimator does not support Time Series Regression as a problem type.

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X)[source]#

Make predictions using the fitted XGBoost classifier.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.DataFrame

predict_proba(self, X)[source]#

Make predictions using the fitted CatBoost classifier.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.DataFrame

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

class evalml.pipelines.XGBoostRegressor(eta: float = 0.1, max_depth: int = 6, min_child_weight: int = 1, n_estimators: int = 100, random_seed: Union[int, float] = 0, n_jobs: int = 12, **kwargs)[source]#

XGBoost Regressor.

Parameters

eta (float) – Boosting learning rate. Defaults to 0.1.
max_depth (int) – Maximum tree depth for base learners. Defaults to 6.
min_child_weight (float) – Minimum sum of instance weight (hessian) needed in a child. Defaults to 1.0
n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds. Defaults to 100.
random_seed (int) – Seed for the random number generator. Defaults to 0.
n_jobs (int) – Number of parallel threads used to run xgboost. Note that creating thread contention will significantly slow down the algorithm. Defaults to 12.

Attributes

hyperparameter_ranges	{ “eta”: Real(0.000001, 1), “max_depth”: Integer(1, 20), “min_child_weight”: Real(1, 10), “n_estimators”: Integer(1, 1000),}
model_family	ModelFamily.XGBOOST
modifies_features	True
modifies_target	False
name	XGBoost Regressor
SEED_MAX	None
SEED_MIN	None
supported_problem_types	[ ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION, ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION,]
training_only	False

Methods

`clone`	Constructs a new component with the same parameters and random state.
`default_parameters`	Returns the default parameters for this component.
`describe`	Describe a component and its parameters.
`feature_importance`	Feature importance of fitted XGBoost regressor.
`fit`	Fits XGBoost regressor component to data.
`get_prediction_intervals`	Find the prediction intervals using the fitted XGBoostRegressor.
`load`	Loads component at file path.
`needs_fitting`	Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.
`parameters`	Returns the parameters which were used to initialize the component.
`predict`	Make predictions using fitted XGBoost regressor.
`predict_proba`	Make probability estimates for labels.
`save`	Saves component at file path.
`update_parameters`	Updates the parameter dictionary of the component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns: A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns: Default parameters for this component.
Return type: dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters

print_name (bool, optional) – whether to print name of component
return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

property feature_importance(self) → pandas.Series#: Feature importance of fitted XGBoost regressor.

fit(self, X: pandas.DataFrame, y: Optional[pandas.Series] = None)[source]#

Fits XGBoost regressor component to data.

Parameters

X (pd.DataFrame) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

Find the prediction intervals using the fitted XGBoostRegressor.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
y (pd.Series) – Target data. Ignored.
coverage (List[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.
predictions (pd.Series) – Optional list of predictions to use. If None, will generate predictions using X.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

static load(file_path)#

Loads component at file path.

Parameters: file_path (str) – Location to load file.
Returns: ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns: True.

property parameters(self)#: Returns the parameters which were used to initialize the component.

predict(self, X: pandas.DataFrame) → pandas.Series[source]#

Make predictions using fitted XGBoost regressor.

Parameters: X (pd.DataFrame) – Data of shape [n_samples, n_features].
Returns: Predicted values.
Return type: pd.Series

predict_proba(self, X: pandas.DataFrame) → pandas.Series#

Make probability estimates for labels.

Parameters: X (pd.DataFrame) – Features.
Returns: Probability estimates.
Return type: pd.Series
Raises: MethodPropertyNotFoundError – If estimator does not have a predict_proba method or a component_obj that implements predict_proba.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

update_parameters(self, update_dict, reset_fit=True)#

Updates the parameter dictionary of the component.

Parameters

update_dict (dict) – A dict of parameters to update.
reset_fit (bool, optional) – If True, will set _is_fitted to False.

Pipelines#

Subpackages#

Submodules#

Package Contents#

Classes Summary#

Contents#