engine#

EvalML Engine classes used to evaluate pipelines in AutoMLSearch.

Submodules#

Package Contents#

Classes Summary#

`CFEngine`	The concurrent.futures (CF) engine.
`DaskEngine`	The dask engine.
`EngineBase`	Base class for EvalML engines.
`EngineComputation`	Wrapper around the result of a (possibly asynchronous) engine computation.
`SequentialEngine`	The default engine for the AutoML search.

Functions#

`evaluate_pipeline`	Function submitted to the submit_evaluation_job engine method.
`train_and_score_pipeline`	Given a pipeline, config and data, train and score the pipeline and return the CV or TV scores.
`train_pipeline`	Train a pipeline and tune the threshold if necessary.

Contents#

class evalml.automl.engine.CFEngine(client=None)[source]#

The concurrent.futures (CF) engine.

Parameters: client (None or CFClient) – If None, creates a threaded pool for processing. Defaults to None.

Methods

`close`	Function to properly shutdown the Engine's Client's resources.
`is_closed`	Property that determines whether the Engine's Client's resources are shutdown.
`setup_job_log`	Set up logger for job.
`submit_evaluation_job`	Send evaluation job to cluster.
`submit_scoring_job`	Send scoring job to cluster.
`submit_training_job`	Send training job to cluster.

close(self)[source]#: Function to properly shutdown the Engine’s Client’s resources.

property is_closed(self)#: Property that determines whether the Engine’s Client’s resources are shutdown.

static setup_job_log()#: Set up logger for job.

submit_evaluation_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#

Send evaluation job to cluster.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to evaluate.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.
X_holdout (pd.Series) – Holdout input data for holdout scoring.
y_holdout (pd.Series) – Holdout target data for holdout scoring.

Returns

An object wrapping a reference to a future-like computation: occurring in the resource pool

Return type

CFComputation

submit_scoring_job(self, automl_config, pipeline, X, y, objectives, X_train=None, y_train=None)[source]#

Send scoring job to cluster.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to train.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.
X_train (pd.DataFrame) – Training features. Used for feature engineering in time series.
y_train (pd.Series) – Training target. Used for feature engineering in time series.
objectives (list[ObjectiveBase]) – Objectives to score on.

Returns

An object wrapping a reference to a future-like computation: occurring in the resource pool.

Return type

CFComputation

submit_training_job(self, automl_config, pipeline, X, y)[source]#

Send training job to cluster.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to train.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.

Returns

An object wrapping a reference to a future-like computation: occurring in the resource pool

Return type

CFComputation

class evalml.automl.engine.DaskEngine(cluster=None)[source]#

The dask engine.

Parameters: cluster (None or dd.Client) – If None, creates a local, threaded Dask client for processing. Defaults to None.

Methods

`close`	Closes the underlying cluster.
`is_closed`	Property that determines whether the Engine's Client's resources are shutdown.
`send_data_to_cluster`	Send data to the cluster.
`setup_job_log`	Set up logger for job.
`submit_evaluation_job`	Send evaluation job to cluster.
`submit_scoring_job`	Send scoring job to cluster.
`submit_training_job`	Send training job to cluster.

close(self)[source]#: Closes the underlying cluster.

property is_closed(self)#: Property that determines whether the Engine’s Client’s resources are shutdown.

send_data_to_cluster(self, X, y)[source]#

Send data to the cluster.

The implementation uses caching so the data is only sent once. This follows dask best practices.

Parameters

X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.

Returns

The modeling data.

Return type

dask.Future

static setup_job_log()#: Set up logger for job.

submit_evaluation_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#

Send evaluation job to cluster.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to evaluate.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.
X_holdout (pd.Series) – Holdout input data for holdout scoring.
y_holdout (pd.Series) – Holdout target data for holdout scoring.

Returns

An object wrapping a reference to a future-like computation: occurring in the dask cluster.

Return type

DaskComputation

submit_scoring_job(self, automl_config, pipeline, X, y, objectives, X_train=None, y_train=None)[source]#

Send scoring job to cluster.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to train.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.
X_train (pd.DataFrame) – Training features. Used for feature engineering in time series.
y_train (pd.Series) – Training target. Used for feature engineering in time series.
objectives (list[ObjectiveBase]) – List of objectives to score on.

Returns

An object wrapping a reference to a future-like computation: occurring in the dask cluster.

Return type

DaskComputation

submit_training_job(self, automl_config, pipeline, X, y)[source]#

Send training job to cluster.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to train.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.

Returns

An object wrapping a reference to a future-like computation: occurring in the dask cluster.

Return type

DaskComputation

class evalml.automl.engine.EngineBase[source]#

Base class for EvalML engines.

Methods

`setup_job_log`	Set up logger for job.
`submit_evaluation_job`	Submit job for pipeline evaluation during AutoMLSearch.
`submit_scoring_job`	Submit job for pipeline scoring.
`submit_training_job`	Submit job for pipeline training.

static setup_job_log()[source]#: Set up logger for job.

abstract submit_evaluation_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#: Submit job for pipeline evaluation during AutoMLSearch.

abstract submit_scoring_job(self, automl_config, pipeline, X, y, objectives, X_train=None, y_train=None)[source]#: Submit job for pipeline scoring.

abstract submit_training_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#: Submit job for pipeline training.

class evalml.automl.engine.EngineComputation[source]#

Wrapper around the result of a (possibly asynchronous) engine computation.

Methods

`cancel`	Cancel the computation.
`done`	Whether the computation is done.
`get_result`	Gets the computation result. Will block until the computation is finished.

abstract cancel(self)[source]#: Cancel the computation.

abstract done(self)[source]#: Whether the computation is done.

abstract get_result(self)[source]#

Gets the computation result. Will block until the computation is finished.

Raises Exception: If computation fails. Returns traceback.

evalml.automl.engine.evaluate_pipeline(pipeline, automl_config, X, y, logger, X_holdout=None, y_holdout=None)[source]#

Function submitted to the submit_evaluation_job engine method.

Parameters

pipeline (PipelineBase) – The pipeline to score.
automl_config (AutoMLConfig) – The AutoMLSearch object, used to access config and the error callback.
X (pd.DataFrame) – Training features.
y (pd.Series) – Training target.
logger – Logger object to write to.
X_holdout (pd.DataFrame) – Holdout set features.
y_holdout (pd.DataFrame) – Holdout set target.

Returns

First - A dict containing cv_score_mean, cv_scores, training_time and a cv_data structure with details.: Second - The pipeline class we trained and scored. Third - the job logger instance with all the recorded messages.

Return type

tuple of three items

class evalml.automl.engine.SequentialEngine[source]#

The default engine for the AutoML search.

Trains and scores pipelines locally and sequentially.

Methods

`close`	No-op.
`setup_job_log`	Set up logger for job.
`submit_evaluation_job`	Submit a job to evaluate a pipeline.
`submit_scoring_job`	Submit a job to score a pipeline.
`submit_training_job`	Submit a job to train a pipeline.

close(self)[source]#: No-op.

static setup_job_log()#: Set up logger for job.

submit_evaluation_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#

Submit a job to evaluate a pipeline.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to evaluate.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.
X_holdout (pd.Series) – Holdout input data for holdout scoring.
y_holdout (pd.Series) – Holdout target data for holdout scoring.

Returns

Computation result.

Return type

SequentialComputation

submit_scoring_job(self, automl_config, pipeline, X, y, objectives, X_train=None, y_train=None)[source]#

Submit a job to score a pipeline.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to train.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.
X_train (pd.DataFrame) – Training features. Used for feature engineering in time series.
y_train (pd.Series) – Training target. Used for feature engineering in time series.
objectives (list[ObjectiveBase]) – List of objectives to score on.

Returns

Computation result.

Return type

SequentialComputation

submit_training_job(self, automl_config, pipeline, X, y)[source]#

Submit a job to train a pipeline.

Parameters

automl_config – Structure containing data passed from AutoMLSearch instance.
pipeline (pipeline.PipelineBase) – Pipeline to evaluate.
X (pd.DataFrame) – Input data for modeling.
y (pd.Series) – Target data for modeling.

Returns

Computation result.

Return type

SequentialComputation

evalml.automl.engine.train_and_score_pipeline(pipeline, automl_config, full_X_train, full_y_train, logger, X_holdout=None, y_holdout=None)[source]#

Given a pipeline, config and data, train and score the pipeline and return the CV or TV scores.

Parameters

pipeline (PipelineBase) – The pipeline to score.
automl_config (AutoMLSearch) – The AutoMLSearch object, used to access config and the error callback.
full_X_train (pd.DataFrame) – Training features.
full_y_train (pd.Series) – Training target.
logger – Logger object to write to.
X_holdout (pd.DataFrame) – Holdout set features.
y_holdout (pd.DataFrame) – Holdout set target.

Raises

Exception – If there are missing target values in the training set after data split.

Returns

First - A dict containing cv_score_mean, cv_scores, training_time and a cv_data structure with details.: Second - The pipeline class we trained and scored. Third - the job logger instance with all the recorded messages.

Return type

tuple of three items

evalml.automl.engine.train_pipeline(pipeline, X, y, automl_config, schema=True, get_hashes=False)[source]#

Train a pipeline and tune the threshold if necessary.

Parameters

pipeline (PipelineBase) – Pipeline to train.
X (pd.DataFrame) – Features to train on.
y (pd.Series) – Target to train on.
automl_config (AutoMLSearch) – The AutoMLSearch object, used to access config and the error callback.
schema (bool) – Whether to use the schemas for X and y. Defaults to True.
get_hashes (bool) – Whether to return the hashes of the data used to train (and potentially threshold). Defaults to False

Returns

A trained pipeline instance. hash (optional): The hash of the input data indices, only returned when get_hashes is True.

Return type

pipeline (PipelineBase)