graphs#
Model understanding graphing utilities.
Module Contents#
Functions#
Computes objective score as a function of potential binary classification decision thresholds for a fitted binary classification pipeline. |
|
Confusion matrix for binary and multiclass classification. |
|
Return data for a fitted tree in a restructured format. |
|
Return data for a fitted pipeline in a restructured format. |
|
Returns a dataframe showing the features with the greatest predictive power for a linear model. |
|
Combines y_true and y_pred into a single dataframe and adds a column for outliers. Used in graph_prediction_vs_actual(). |
|
Get the data needed for the prediction_vs_actual_over_time plot. |
|
Generates a plot graphing objective score vs. decision thresholds for a fitted binary classification pipeline. |
|
Generate and display a confusion matrix plot. |
|
Generate a bar graph of the pipeline's permutation importance. |
|
Generate and display a precision-recall plot. |
|
Generate a scatter plot comparing the true and predicted values. Used for regression plotting. |
|
Plot the target values and predictions against time on the x-axis. |
|
Generate and display a Receiver Operating Characteristic (ROC) plot for binary and multiclass classification problems. |
|
Plot high dimensional data into lower dimensional space using t-SNE. |
|
Normalizes a confusion matrix. |
|
Given labels and binary classifier predicted probabilities, compute and return the data representing a precision-recall curve. |
|
Given labels and classifier predicted probabilities, compute and return the data representing a Receiver Operating Characteristic (ROC) curve. Works with binary or multiclass problems. |
|
Get the transformed output after fitting X to the embedded space using t-SNE. |
|
Generate an image visualizing the decision tree. |
Contents#
- evalml.model_understanding.graphs.binary_objective_vs_threshold(pipeline, X, y, objective, steps=100)[source]#
Computes objective score as a function of potential binary classification decision thresholds for a fitted binary classification pipeline.
- Parameters
pipeline (BinaryClassificationPipeline obj) – Fitted binary classification pipeline.
X (pd.DataFrame) – The input data used to compute objective score.
y (pd.Series) – The target labels.
objective (ObjectiveBase obj, str) – Objective used to score.
steps (int) – Number of intervals to divide and calculate objective score at.
- Returns
DataFrame with thresholds and the corresponding objective score calculated at each threshold.
- Return type
pd.DataFrame
- Raises
ValueError – If objective is not a binary classification objective.
ValueError – If objective’s score_needs_proba is not False.
- evalml.model_understanding.graphs.confusion_matrix(y_true, y_predicted, normalize_method='true')[source]#
Confusion matrix for binary and multiclass classification.
- Parameters
y_true (pd.Series or np.ndarray) – True binary labels.
y_predicted (pd.Series or np.ndarray) – Predictions from a binary classifier.
normalize_method ({'true', 'pred', 'all', None}) – Normalization method to use, if not None. Supported options are: ‘true’ to normalize by row, ‘pred’ to normalize by column, or ‘all’ to normalize by all values. Defaults to ‘true’.
- Returns
Confusion matrix. The column header represents the predicted labels while row header represents the actual labels.
- Return type
pd.DataFrame
- evalml.model_understanding.graphs.decision_tree_data_from_estimator(estimator)[source]#
Return data for a fitted tree in a restructured format.
- Parameters
estimator (ComponentBase) – A fitted DecisionTree-based estimator.
- Returns
An OrderedDict of OrderedDicts describing a tree structure.
- Return type
OrderedDict
- Raises
ValueError – If estimator is not a decision tree-based estimator.
NotFittedError – If estimator is not yet fitted.
- evalml.model_understanding.graphs.decision_tree_data_from_pipeline(pipeline_)[source]#
Return data for a fitted pipeline in a restructured format.
- Parameters
pipeline (PipelineBase) – A pipeline with a DecisionTree-based estimator.
- Returns
An OrderedDict of OrderedDicts describing a tree structure.
- Return type
OrderedDict
- Raises
ValueError – If estimator is not a decision tree-based estimator.
NotFittedError – If estimator is not yet fitted.
- evalml.model_understanding.graphs.get_linear_coefficients(estimator, features=None)[source]#
Returns a dataframe showing the features with the greatest predictive power for a linear model.
- Parameters
estimator (Estimator) – Fitted linear model family estimator.
features (list[str]) – List of feature names associated with the underlying data.
- Returns
Displaying the features by importance.
- Return type
pd.DataFrame
- Raises
ValueError – If the model is not a linear model.
NotFittedError – If the model is not yet fitted.
- evalml.model_understanding.graphs.get_prediction_vs_actual_data(y_true, y_pred, outlier_threshold=None)[source]#
Combines y_true and y_pred into a single dataframe and adds a column for outliers. Used in graph_prediction_vs_actual().
- Parameters
y_true (pd.Series, or np.ndarray) – The real target values of the data
y_pred (pd.Series, or np.ndarray) – The predicted values outputted by the regression model.
outlier_threshold (int, float) – A positive threshold for what is considered an outlier value. This value is compared to the absolute difference between each value of y_true and y_pred. Values within this threshold will be blue, otherwise they will be yellow. Defaults to None.
- Returns
prediction: Predicted values from regression model.
actual: Real target values.
outlier: Colors indicating which values are in the threshold for what is considered an outlier value.
- Return type
pd.DataFrame with the following columns
- Raises
ValueError – If threshold is not positive.
- evalml.model_understanding.graphs.get_prediction_vs_actual_over_time_data(pipeline, X, y, X_train, y_train, dates)[source]#
Get the data needed for the prediction_vs_actual_over_time plot.
- Parameters
pipeline (TimeSeriesRegressionPipeline) – Fitted time series regression pipeline.
X (pd.DataFrame) – Features used to generate new predictions.
y (pd.Series) – Target values to compare predictions against.
X_train (pd.DataFrame) – Data the pipeline was trained on.
y_train (pd.Series) – Target values for training data.
dates (pd.Series) – Dates corresponding to target values and predictions.
- Returns
Predictions vs. time.
- Return type
pd.DataFrame
- evalml.model_understanding.graphs.graph_binary_objective_vs_threshold(pipeline, X, y, objective, steps=100)[source]#
Generates a plot graphing objective score vs. decision thresholds for a fitted binary classification pipeline.
- Parameters
pipeline (PipelineBase or subclass) – Fitted pipeline
X (pd.DataFrame) – The input data used to score and compute scores
y (pd.Series) – The target labels
objective (ObjectiveBase obj, str) – Objective used to score, shown on the y-axis of the graph
steps (int) – Number of intervals to divide and calculate objective score at
- Returns
plotly.Figure representing the objective score vs. threshold graph generated
- evalml.model_understanding.graphs.graph_confusion_matrix(y_true, y_pred, normalize_method='true', title_addition=None)[source]#
Generate and display a confusion matrix plot.
If normalize_method is set, hover text will show raw count, otherwise hover text will show count normalized with method ‘true’.
- Parameters
y_true (pd.Series or np.ndarray) – True binary labels.
y_pred (pd.Series or np.ndarray) – Predictions from a binary classifier.
normalize_method ({'true', 'pred', 'all', None}) – Normalization method to use, if not None. Supported options are: ‘true’ to normalize by row, ‘pred’ to normalize by column, or ‘all’ to normalize by all values. Defaults to ‘true’.
title_addition (str) – If not None, append to plot title. Defaults to None.
- Returns
plotly.Figure representing the confusion matrix plot generated.
- evalml.model_understanding.graphs.graph_permutation_importance(pipeline, X, y, objective, importance_threshold=0)[source]#
Generate a bar graph of the pipeline’s permutation importance.
- Parameters
pipeline (PipelineBase or subclass) – Fitted pipeline.
X (pd.DataFrame) – The input data used to score and compute permutation importance.
y (pd.Series) – The target data.
objective (str, ObjectiveBase) – Objective to score on.
importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to 0.
- Returns
plotly.Figure, a bar graph showing features and their respective permutation importance.
- Raises
ValueError – If importance_threshold is not greater than or equal to 0.
- evalml.model_understanding.graphs.graph_precision_recall_curve(y_true, y_pred_proba, title_addition=None)[source]#
Generate and display a precision-recall plot.
- Parameters
y_true (pd.Series or np.ndarray) – True binary labels.
y_pred_proba (pd.Series or np.ndarray) – Predictions from a binary classifier, before thresholding has been applied. Note this should be the predicted probability for the “true” label.
title_addition (str or None) – If not None, append to plot title. Defaults to None.
- Returns
plotly.Figure representing the precision-recall plot generated
- evalml.model_understanding.graphs.graph_prediction_vs_actual(y_true, y_pred, outlier_threshold=None)[source]#
Generate a scatter plot comparing the true and predicted values. Used for regression plotting.
- Parameters
y_true (pd.Series) – The real target values of the data.
y_pred (pd.Series) – The predicted values outputted by the regression model.
outlier_threshold (int, float) – A positive threshold for what is considered an outlier value. This value is compared to the absolute difference between each value of y_true and y_pred. Values within this threshold will be blue, otherwise they will be yellow. Defaults to None.
- Returns
plotly.Figure representing the predicted vs. actual values graph
- Raises
ValueError – If threshold is not positive.
- evalml.model_understanding.graphs.graph_prediction_vs_actual_over_time(pipeline, X, y, X_train, y_train, dates)[source]#
Plot the target values and predictions against time on the x-axis.
- Parameters
pipeline (TimeSeriesRegressionPipeline) – Fitted time series regression pipeline.
X (pd.DataFrame) – Features used to generate new predictions.
y (pd.Series) – Target values to compare predictions against.
X_train (pd.DataFrame) – Data the pipeline was trained on.
y_train (pd.Series) – Target values for training data.
dates (pd.Series) – Dates corresponding to target values and predictions.
- Returns
Showing the prediction vs actual over time.
- Return type
plotly.Figure
- Raises
ValueError – If the pipeline is not a time-series regression pipeline.
- evalml.model_understanding.graphs.graph_roc_curve(y_true, y_pred_proba, custom_class_names=None, title_addition=None)[source]#
Generate and display a Receiver Operating Characteristic (ROC) plot for binary and multiclass classification problems.
- Parameters
y_true (pd.Series or np.ndarray) – True labels.
y_pred_proba (pd.Series or np.ndarray) – Predictions from a classifier, before thresholding has been applied. Note this should a one dimensional array with the predicted probability for the “true” label in the binary case.
custom_class_names (list or None) – If not None, custom labels for classes. Defaults to None.
title_addition (str or None) – if not None, append to plot title. Defaults to None.
- Returns
plotly.Figure representing the ROC plot generated
- Raises
ValueError – If the number of custom class names does not match number of classes in the input data.
- evalml.model_understanding.graphs.graph_t_sne(X, n_components=2, perplexity=30.0, learning_rate=200.0, metric='euclidean', marker_line_width=2, marker_size=7, **kwargs)[source]#
Plot high dimensional data into lower dimensional space using t-SNE.
- Parameters
X (np.ndarray, pd.DataFrame) – Data to be transformed. Must be numeric.
n_components (int) – Dimension of the embedded space. Defaults to 2.
perplexity (float) – Related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Defaults to 30.
learning_rate (float) – Usually in the range [10.0, 1000.0]. If the cost function gets stuck in a bad local minimum, increasing the learning rate may help. Must be positive. Defaults to 200.
metric (str) – The metric to use when calculating distance between instances in a feature array. The default is “euclidean” which is interpreted as the squared euclidean distance.
marker_line_width (int) – Determines the line width of the marker boundary. Defaults to 2.
marker_size (int) – Determines the size of the marker. Defaults to 7.
kwargs – Arbitrary keyword arguments.
- Returns
Figure representing the transformed data.
- Return type
plotly.Figure
- Raises
ValueError – If marker_line_width or marker_size are not valid values.
- evalml.model_understanding.graphs.normalize_confusion_matrix(conf_mat, normalize_method='true')[source]#
Normalizes a confusion matrix.
- Parameters
conf_mat (pd.DataFrame or np.ndarray) – Confusion matrix to normalize.
normalize_method ({'true', 'pred', 'all'}) – Normalization method. Supported options are: ‘true’ to normalize by row, ‘pred’ to normalize by column, or ‘all’ to normalize by all values. Defaults to ‘true’.
- Returns
normalized version of the input confusion matrix. The column header represents the predicted labels while row header represents the actual labels.
- Return type
pd.DataFrame
- Raises
ValueError – If configuration is invalid, or if the sum of a given axis is zero and normalization by axis is specified.
- evalml.model_understanding.graphs.precision_recall_curve(y_true, y_pred_proba, pos_label_idx=- 1)[source]#
Given labels and binary classifier predicted probabilities, compute and return the data representing a precision-recall curve.
- Parameters
y_true (pd.Series or np.ndarray) – True binary labels.
y_pred_proba (pd.Series or np.ndarray) – Predictions from a binary classifier, before thresholding has been applied. Note this should be the predicted probability for the “true” label.
pos_label_idx (int) – the column index corresponding to the positive class. If predicted probabilities are two-dimensional, this will be used to access the probabilities for the positive class.
- Returns
Dictionary containing metrics used to generate a precision-recall plot, with the following keys:
precision: Precision values.
recall: Recall values.
thresholds: Threshold values used to produce the precision and recall.
auc_score: The area under the ROC curve.
- Return type
list
- Raises
NoPositiveLabelException – If predicted probabilities do not contain a column at the specified label.
- evalml.model_understanding.graphs.roc_curve(y_true, y_pred_proba)[source]#
Given labels and classifier predicted probabilities, compute and return the data representing a Receiver Operating Characteristic (ROC) curve. Works with binary or multiclass problems.
- Parameters
y_true (pd.Series or np.ndarray) – True labels.
y_pred_proba (pd.Series or np.ndarray) – Predictions from a classifier, before thresholding has been applied.
- Returns
- A list of dictionaries (with one for each class) is returned. Binary classification problems return a list with one dictionary.
- Each dictionary contains metrics used to generate an ROC plot with the following keys:
fpr_rate: False positive rate.
tpr_rate: True positive rate.
threshold: Threshold values used to produce each pair of true/false positive rates.
auc_score: The area under the ROC curve.
- Return type
list(dict)
- evalml.model_understanding.graphs.t_sne(X, n_components=2, perplexity=30.0, learning_rate=200.0, metric='euclidean', **kwargs)[source]#
Get the transformed output after fitting X to the embedded space using t-SNE.
- Args:
X (np.ndarray, pd.DataFrame): Data to be transformed. Must be numeric. n_components (int, optional): Dimension of the embedded space. perplexity (float, optional): Related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. learning_rate (float, optional): Usually in the range [10.0, 1000.0]. If the cost function gets stuck in a bad local minimum, increasing the learning rate may help. metric (str, optional): The metric to use when calculating distance between instances in a feature array. kwargs: Arbitrary keyword arguments.
- Returns
TSNE output.
- Return type
np.ndarray (n_samples, n_components)
- Raises
ValueError – If specified parameters are not valid values.
- evalml.model_understanding.graphs.visualize_decision_tree(estimator, max_depth=None, rotate=False, filled=False, filepath=None)[source]#
Generate an image visualizing the decision tree.
- Parameters
estimator (ComponentBase) – A fitted DecisionTree-based estimator.
max_depth (int, optional) – The depth to which the tree should be displayed. If set to None (as by default), tree is fully generated.
rotate (bool, optional) – Orient tree left to right rather than top-down.
filled (bool, optional) – Paint nodes to indicate majority class for classification, extremity of values for regression, or purity of node for multi-output.
filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.
- Returns
DOT object that can be directly displayed in Jupyter notebooks.
- Return type
graphviz.Source
- Raises
ValueError – If estimator is not a decision tree-based estimator.
NotFittedError – If estimator is not yet fitted.