engine#

EvalML Engine classes used to evaluate pipelines in AutoMLSearch.

Package Contents#

Classes Summary#

CFEngine

The concurrent.futures (CF) engine.

DaskEngine

The dask engine.

EngineBase

Base class for EvalML engines.

EngineComputation

Wrapper around the result of a (possibly asynchronous) engine computation.

SequentialEngine

The default engine for the AutoML search.

Functions#

evaluate_pipeline

Function submitted to the submit_evaluation_job engine method.

train_and_score_pipeline

Given a pipeline, config and data, train and score the pipeline and return the CV or TV scores.

train_pipeline

Train a pipeline and tune the threshold if necessary.

Contents#

class evalml.automl.engine.CFEngine(client=None)[source]#

The concurrent.futures (CF) engine.

Parameters

client (None or CFClient) – If None, creates a threaded pool for processing. Defaults to None.

Methods

close

Function to properly shutdown the Engine's Client's resources.

is_closed

Property that determines whether the Engine's Client's resources are shutdown.

setup_job_log

Set up logger for job.

submit_evaluation_job

Send evaluation job to cluster.

submit_scoring_job

Send scoring job to cluster.

submit_training_job

Send training job to cluster.

close(self)[source]#

Function to properly shutdown the Engine’s Client’s resources.

property is_closed(self)#

Property that determines whether the Engine’s Client’s resources are shutdown.

static setup_job_log()#

Set up logger for job.

submit_evaluation_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#

Send evaluation job to cluster.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to evaluate.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

  • X_holdout (pd.Series) – Holdout input data for holdout scoring.

  • y_holdout (pd.Series) – Holdout target data for holdout scoring.

Returns

An object wrapping a reference to a future-like computation

occurring in the resource pool

Return type

CFComputation

submit_scoring_job(self, automl_config, pipeline, X, y, objectives, X_train=None, y_train=None)[source]#

Send scoring job to cluster.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to train.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

  • X_train (pd.DataFrame) – Training features. Used for feature engineering in time series.

  • y_train (pd.Series) – Training target. Used for feature engineering in time series.

  • objectives (list[ObjectiveBase]) – Objectives to score on.

Returns

An object wrapping a reference to a future-like computation

occurring in the resource pool.

Return type

CFComputation

submit_training_job(self, automl_config, pipeline, X, y)[source]#

Send training job to cluster.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to train.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

Returns

An object wrapping a reference to a future-like computation

occurring in the resource pool

Return type

CFComputation

class evalml.automl.engine.DaskEngine(cluster=None)[source]#

The dask engine.

Parameters

cluster (None or dd.Client) – If None, creates a local, threaded Dask client for processing. Defaults to None.

Methods

close

Closes the underlying cluster.

is_closed

Property that determines whether the Engine's Client's resources are shutdown.

send_data_to_cluster

Send data to the cluster.

setup_job_log

Set up logger for job.

submit_evaluation_job

Send evaluation job to cluster.

submit_scoring_job

Send scoring job to cluster.

submit_training_job

Send training job to cluster.

close(self)[source]#

Closes the underlying cluster.

property is_closed(self)#

Property that determines whether the Engine’s Client’s resources are shutdown.

send_data_to_cluster(self, X, y)[source]#

Send data to the cluster.

The implementation uses caching so the data is only sent once. This follows dask best practices.

Parameters
  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

Returns

The modeling data.

Return type

dask.Future

static setup_job_log()#

Set up logger for job.

submit_evaluation_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#

Send evaluation job to cluster.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to evaluate.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

  • X_holdout (pd.Series) – Holdout input data for holdout scoring.

  • y_holdout (pd.Series) – Holdout target data for holdout scoring.

Returns

An object wrapping a reference to a future-like computation

occurring in the dask cluster.

Return type

DaskComputation

submit_scoring_job(self, automl_config, pipeline, X, y, objectives, X_train=None, y_train=None)[source]#

Send scoring job to cluster.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to train.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

  • X_train (pd.DataFrame) – Training features. Used for feature engineering in time series.

  • y_train (pd.Series) – Training target. Used for feature engineering in time series.

  • objectives (list[ObjectiveBase]) – List of objectives to score on.

Returns

An object wrapping a reference to a future-like computation

occurring in the dask cluster.

Return type

DaskComputation

submit_training_job(self, automl_config, pipeline, X, y)[source]#

Send training job to cluster.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to train.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

Returns

An object wrapping a reference to a future-like computation

occurring in the dask cluster.

Return type

DaskComputation

class evalml.automl.engine.EngineBase[source]#

Base class for EvalML engines.

Methods

setup_job_log

Set up logger for job.

submit_evaluation_job

Submit job for pipeline evaluation during AutoMLSearch.

submit_scoring_job

Submit job for pipeline scoring.

submit_training_job

Submit job for pipeline training.

static setup_job_log()[source]#

Set up logger for job.

abstract submit_evaluation_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#

Submit job for pipeline evaluation during AutoMLSearch.

abstract submit_scoring_job(self, automl_config, pipeline, X, y, objectives, X_train=None, y_train=None)[source]#

Submit job for pipeline scoring.

abstract submit_training_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#

Submit job for pipeline training.

class evalml.automl.engine.EngineComputation[source]#

Wrapper around the result of a (possibly asynchronous) engine computation.

Methods

cancel

Cancel the computation.

done

Whether the computation is done.

get_result

Gets the computation result. Will block until the computation is finished.

abstract cancel(self)[source]#

Cancel the computation.

abstract done(self)[source]#

Whether the computation is done.

abstract get_result(self)[source]#

Gets the computation result. Will block until the computation is finished.

Raises Exception: If computation fails. Returns traceback.

evalml.automl.engine.evaluate_pipeline(pipeline, automl_config, X, y, logger, X_holdout=None, y_holdout=None)[source]#

Function submitted to the submit_evaluation_job engine method.

Parameters
  • pipeline (PipelineBase) – The pipeline to score.

  • automl_config (AutoMLConfig) – The AutoMLSearch object, used to access config and the error callback.

  • X (pd.DataFrame) – Training features.

  • y (pd.Series) – Training target.

  • logger – Logger object to write to.

  • X_holdout (pd.DataFrame) – Holdout set features.

  • y_holdout (pd.DataFrame) – Holdout set target.

Returns

First - A dict containing cv_score_mean, cv_scores, training_time and a cv_data structure with details.

Second - The pipeline class we trained and scored. Third - the job logger instance with all the recorded messages.

Return type

tuple of three items

class evalml.automl.engine.SequentialEngine[source]#

The default engine for the AutoML search.

Trains and scores pipelines locally and sequentially.

Methods

close

No-op.

setup_job_log

Set up logger for job.

submit_evaluation_job

Submit a job to evaluate a pipeline.

submit_scoring_job

Submit a job to score a pipeline.

submit_training_job

Submit a job to train a pipeline.

close(self)[source]#

No-op.

static setup_job_log()#

Set up logger for job.

submit_evaluation_job(self, automl_config, pipeline, X, y, X_holdout=None, y_holdout=None)[source]#

Submit a job to evaluate a pipeline.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to evaluate.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

  • X_holdout (pd.Series) – Holdout input data for holdout scoring.

  • y_holdout (pd.Series) – Holdout target data for holdout scoring.

Returns

Computation result.

Return type

SequentialComputation

submit_scoring_job(self, automl_config, pipeline, X, y, objectives, X_train=None, y_train=None)[source]#

Submit a job to score a pipeline.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to train.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

  • X_train (pd.DataFrame) – Training features. Used for feature engineering in time series.

  • y_train (pd.Series) – Training target. Used for feature engineering in time series.

  • objectives (list[ObjectiveBase]) – List of objectives to score on.

Returns

Computation result.

Return type

SequentialComputation

submit_training_job(self, automl_config, pipeline, X, y)[source]#

Submit a job to train a pipeline.

Parameters
  • automl_config – Structure containing data passed from AutoMLSearch instance.

  • pipeline (pipeline.PipelineBase) – Pipeline to evaluate.

  • X (pd.DataFrame) – Input data for modeling.

  • y (pd.Series) – Target data for modeling.

Returns

Computation result.

Return type

SequentialComputation

evalml.automl.engine.train_and_score_pipeline(pipeline, automl_config, full_X_train, full_y_train, logger, X_holdout=None, y_holdout=None)[source]#

Given a pipeline, config and data, train and score the pipeline and return the CV or TV scores.

Parameters
  • pipeline (PipelineBase) – The pipeline to score.

  • automl_config (AutoMLSearch) – The AutoMLSearch object, used to access config and the error callback.

  • full_X_train (pd.DataFrame) – Training features.

  • full_y_train (pd.Series) – Training target.

  • logger – Logger object to write to.

  • X_holdout (pd.DataFrame) – Holdout set features.

  • y_holdout (pd.DataFrame) – Holdout set target.

Raises

Exception – If there are missing target values in the training set after data split.

Returns

First - A dict containing cv_score_mean, cv_scores, training_time and a cv_data structure with details.

Second - The pipeline class we trained and scored. Third - the job logger instance with all the recorded messages.

Return type

tuple of three items

evalml.automl.engine.train_pipeline(pipeline, X, y, automl_config, schema=True, get_hashes=False)[source]#

Train a pipeline and tune the threshold if necessary.

Parameters
  • pipeline (PipelineBase) – Pipeline to train.

  • X (pd.DataFrame) – Features to train on.

  • y (pd.Series) – Target to train on.

  • automl_config (AutoMLSearch) – The AutoMLSearch object, used to access config and the error callback.

  • schema (bool) – Whether to use the schemas for X and y. Defaults to True.

  • get_hashes (bool) – Whether to return the hashes of the data used to train (and potentially threshold). Defaults to False

Returns

A trained pipeline instance. hash (optional): The hash of the input data indices, only returned when get_hashes is True.

Return type

pipeline (PipelineBase)