binary_classification_pipeline#

Pipeline subclass for all binary classification pipelines.

Module Contents#

Classes Summary#

BinaryClassificationPipeline

Pipeline subclass for all binary classification pipelines.

Contents#

class evalml.pipelines.binary_classification_pipeline.BinaryClassificationPipeline(component_graph, parameters=None, custom_name=None, random_seed=0)[source]#

Pipeline subclass for all binary classification pipelines.

Parameters

component_graph (ComponentGraph, list, dict) – ComponentGraph instance, list of components in order, or dictionary of components. Accepts strings or ComponentBase subclasses in the list. Note that when duplicate components are specified in a list, the duplicate component names will be modified with the component’s index in the list. For example, the component graph [Imputer, One Hot Encoder, Imputer, Logistic Regression Classifier] will have names [“Imputer”, “One Hot Encoder”, “Imputer_2”, “Logistic Regression Classifier”]
parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
custom_name (str) – Custom name for the pipeline. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Example

>>> pipeline = BinaryClassificationPipeline(component_graph=["Simple Imputer", "Logistic Regression Classifier"],
...                                         parameters={"Logistic Regression Classifier": {"penalty": "elasticnet",
...                                                                                        "solver": "liblinear"}},
...                                         custom_name="My Binary Pipeline")
...
>>> assert pipeline.custom_name == "My Binary Pipeline"
>>> assert pipeline.component_graph.component_dict.keys() == {'Simple Imputer', 'Logistic Regression Classifier'}

The pipeline parameters will be chosen from the default parameters for every component, unless specific parameters were passed in as they were above.

>>> assert pipeline.parameters == {
...     'Simple Imputer': {'impute_strategy': 'most_frequent', 'fill_value': None},
...     'Logistic Regression Classifier': {'penalty': 'elasticnet',
...                                        'C': 1.0,
...                                        'n_jobs': -1,
...                                        'multi_class': 'auto',
...                                        'solver': 'liblinear'}}

Attributes

problem_type

ProblemTypes.BINARY

Methods

`can_tune_threshold_with_objective`	Determine whether the threshold of a binary classification pipeline can be tuned.
`classes_`	Gets the class names for the pipeline. Will return None before pipeline is fit.
`clone`	Constructs a new pipeline with the same components, parameters, and random seed.
`create_objectives`	Create objective instances from a list of strings or objective classes.
`custom_name`	Custom name of the pipeline.
`describe`	Outputs pipeline details including component parameters.
`feature_importance`	Importance associated with each feature. Features dropped by the feature selection are excluded.
`fit`	Build a classification model. For string and categorical targets, classes are sorted by sorted(set(y)) and then are mapped to values between 0 and n_classes-1.
`get_component`	Returns component by name.
`get_hyperparameter_ranges`	Returns hyperparameter ranges from all components as a dictionary.
`graph`	Generate an image representing the pipeline graph.
`graph_dict`	Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.
`graph_feature_importance`	Generate a bar graph of the pipeline's feature importance.
`inverse_transform`	Apply component inverse_transform methods to estimator predictions in reverse order.
`load`	Loads pipeline at file path.
`model_family`	Returns model family of this pipeline.
`name`	Name of the pipeline.
`new`	Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python's __new__ method.
`optimize_threshold`	Optimize the pipeline threshold given the objective to use. Only used for binary problems with objectives whose thresholds can be tuned.
`parameters`	Parameter dictionary for this pipeline.
`predict`	Make predictions using selected features.
`predict_proba`	Make probability estimates for labels. Assumes that the column at index 1 represents the positive label case.
`save`	Saves pipeline at file path.
`score`	Evaluate model performance on objectives.
`summary`	A short summary of the pipeline structure, describing the list of components used.
`threshold`	Threshold used to make a prediction. Defaults to None.
`transform`	Transform the input.
`transform_all_but_final`	Transforms the data by applying all pre-processing components.

can_tune_threshold_with_objective(self, objective)#

Determine whether the threshold of a binary classification pipeline can be tuned.

Parameters: objective (ObjectiveBase) – Primary AutoMLSearch objective.
Returns: True if the pipeline threshold can be tuned.
Return type: bool

property classes_(self)#: Gets the class names for the pipeline. Will return None before pipeline is fit.

clone(self)#

Constructs a new pipeline with the same components, parameters, and random seed.

Returns: A new instance of this pipeline with identical components, parameters, and random seed.

static create_objectives(objectives)#: Create objective instances from a list of strings or objective classes.

property custom_name(self)#: Custom name of the pipeline.

describe(self, return_dict=False)#

Outputs pipeline details including component parameters.

Parameters: return_dict (bool) – If True, return dictionary of information about pipeline. Defaults to False.
Returns: Dictionary of all component parameters if return_dict is True, else None.
Return type: dict

property feature_importance(self)#

Importance associated with each feature. Features dropped by the feature selection are excluded.

Returns: Feature names and their corresponding importance
Return type: pd.DataFrame

fit(self, X, y)#

Build a classification model. For string and categorical targets, classes are sorted by sorted(set(y)) and then are mapped to values between 0 and n_classes-1.

Parameters

X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target training labels of length [n_samples]

Returns

self

Raises

ValueError – If the number of unique classes in y are not appropriate for the type of pipeline.

get_component(self, name)#

Returns component by name.

Parameters: name (str) – Name of component.
Returns: Component to return
Return type: Component

get_hyperparameter_ranges(self, custom_hyperparameters)#

Returns hyperparameter ranges from all components as a dictionary.

Parameters: custom_hyperparameters (dict) – Custom hyperparameters for the pipeline.
Returns: Dictionary of hyperparameter ranges for each component in the pipeline.
Return type: dict

graph(self, filepath=None)#

Generate an image representing the pipeline graph.

Parameters

filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

Graph object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Digraph

Raises

RuntimeError – If graphviz is not installed.
ValueError – If path is not writeable.

graph_dict(self)#

Generates a dictionary with nodes consisting of the component names and parameters, and edges detailing component relationships. This dictionary is JSON serializable in most cases.

x_edges specifies from which component feature data is being passed. y_edges specifies from which component target data is being passed. This can be used to build graphs across a variety of visualization tools. Template: {“Nodes”: {“component_name”: {“Name”: class_name, “Parameters”: parameters_attributes}, …}}, “x_edges”: [[from_component_name, to_component_name], [from_component_name, to_component_name], …], “y_edges”: [[from_component_name, to_component_name], [from_component_name, to_component_name], …]}

Returns: A dictionary representing the DAG structure.
Return type: dag_dict (dict)

graph_feature_importance(self, importance_threshold=0)#

Generate a bar graph of the pipeline’s feature importance.

Parameters: importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to zero.
Returns: A bar graph showing features and their corresponding importance.
Return type: plotly.Figure
Raises: ValueError – If importance threshold is not valid.

inverse_transform(self, y)#

Apply component inverse_transform methods to estimator predictions in reverse order.

Components that implement inverse_transform are PolynomialDetrender, LogTransformer, LabelEncoder (tbd).

Parameters: y (pd.Series) – Final component features.
Returns: The inverse transform of the target.
Return type: pd.Series

static load(file_path)#

Loads pipeline at file path.

Parameters: file_path (str) – Location to load file.
Returns: PipelineBase object

property model_family(self)#: Returns model family of this pipeline.

property name(self)#: Name of the pipeline.

new(self, parameters, random_seed=0)#

Constructs a new instance of the pipeline with the same component graph but with a different set of parameters. Not to be confused with python’s __new__ method.

Parameters

parameters (dict) – Dictionary with component names as keys and dictionary of that component’s parameters as values. An empty dictionary or None implies using all default values for component parameters. Defaults to None.
random_seed (int) – Seed for the random number generator. Defaults to 0.

Returns

A new instance of this pipeline with identical components.

optimize_threshold(self, X, y, y_pred_proba, objective)#

Optimize the pipeline threshold given the objective to use. Only used for binary problems with objectives whose thresholds can be tuned.

Parameters

X (pd.DataFrame) – Input features.
y (pd.Series) – Input target values.
y_pred_proba (pd.Series) – The predicted probabilities of the target outputted by the pipeline.
objective (ObjectiveBase) – The objective to threshold with. Must have a tunable threshold.

Raises

ValueError – If objective is not optimizable.

property parameters(self)#

Parameter dictionary for this pipeline.

Returns: Dictionary of all component parameters.
Return type: dict

predict(self, X, objective=None, X_train=None, y_train=None)#

Make predictions using selected features.

Note: we cast y as ints first to address boolean values that may be returned from calculating predictions which we would not be able to otherwise transform if we originally had integer targets.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features].
objective (Object or string) – The objective to use to make predictions.
X_train (pd.DataFrame) – Training data. Ignored. Only used for time series.
y_train (pd.Series) – Training labels. Ignored. Only used for time series.

Returns

Estimated labels.

Return type

pd.Series

predict_proba(self, X, X_train=None, y_train=None)[source]#

Make probability estimates for labels. Assumes that the column at index 1 represents the positive label case.

Parameters

X (pd.DataFrame or np.ndarray) – Data of shape [n_samples, n_features]
X_train (pd.DataFrame or np.ndarray or None) – Training data. Ignored. Only used for time series.
y_train (pd.Series or None) – Training labels. Ignored. Only used for time series.

Returns

Probability estimates

Return type

pd.Series

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves pipeline at file path.

Parameters

file_path (str) – Location to save file.
pickle_protocol (int) – The pickle data stream format.

score(self, X, y, objectives, X_train=None, y_train=None)#

Evaluate model performance on objectives.

Parameters

X (pd.DataFrame) – Data of shape [n_samples, n_features]
y (pd.Series) – True labels of length [n_samples]
objectives (list) – List of objectives to score
X_train (pd.DataFrame) – Training data. Ignored. Only used for time series.
y_train (pd.Series) – Training labels. Ignored. Only used for time series.

Returns

Ordered dictionary of objective scores.

Return type

dict

property summary(self)#

A short summary of the pipeline structure, describing the list of components used.

Example: Logistic Regression Classifier w/ Simple Imputer + One Hot Encoder

Returns: A string describing the pipeline structure.

property threshold(self)#: Threshold used to make a prediction. Defaults to None.

transform(self, X, y=None)#

Transform the input.

Parameters

X (pd.DataFrame, or np.ndarray) – Data of shape [n_samples, n_features].
y (pd.Series) – The target data of length [n_samples]. Defaults to None.

Returns

Transformed output.

Return type

pd.DataFrame

transform_all_but_final(self, X, y=None, X_train=None, y_train=None)#

Transforms the data by applying all pre-processing components.

Parameters

X (pd.DataFrame) – Input data to the pipeline to transform.
y (pd.Series or None) – Targets corresponding to X. Optional.
X_train (pd.DataFrame or np.ndarray or None) – Training data. Only used for time series.
y_train (pd.Series or None) – Training labels. Only used for time series.

Returns

New transformed features.

Return type

pd.DataFrame