Model Understanding

Simply examining a model’s performance metrics is not enough to select a model and promote it for use in a production setting. While developing an ML algorithm, it is important to understand how the model behaves on the data, to examine the key factors influencing its predictions and to consider where it may be deficient. Determination of what “success” may mean for an ML project depends first and foremost on the user’s domain expertise.

EvalML includes a variety of tools for understanding models, from graphing utilities to methods for explaining predictions.

** Graphing methods on Jupyter Notebook and Jupyter Lab require ipywidgets to be installed.

** If graphing on Jupyter Lab, jupyterlab-plotly required. To download this, make sure you have npm installed.

Graphing Utilities

First, let’s train a pipeline on some data.

[1]:
import evalml

class DTBinaryClassificationPipeline(evalml.pipelines.BinaryClassificationPipeline):
    component_graph = ['Simple Imputer', 'Decision Tree Classifier']

X, y = evalml.demos.load_breast_cancer()

pipeline_dt = DTBinaryClassificationPipeline({})
pipeline_dt.fit(X, y)
[1]:
DTBinaryClassificationPipeline(parameters={'Simple Imputer':{'impute_strategy': 'most_frequent', 'fill_value': None}, 'Decision Tree Classifier':{'criterion': 'gini', 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0},})

Tree Visualization

We can visualize the structure of the Decision Tree that was fit to that data, and save it if necessary.

[2]:
from evalml.model_understanding.graphs import visualize_decision_tree

visualize_decision_tree(pipeline_dt.estimator, max_depth=2, rotate=False, filled=True, filepath=None)
[2]:
../_images/user_guide_model_understanding_5_0.svg

Lets replace the Decision Tree Classifier with a Random Forest Classifier.

[3]:
class RFBinaryClassificationPipeline(evalml.pipelines.BinaryClassificationPipeline):
    component_graph = ['Simple Imputer', 'Random Forest Classifier']

pipeline = RFBinaryClassificationPipeline({})
pipeline.fit(X, y)
print(pipeline.score(X, y, objectives=['log loss binary']))
OrderedDict([('Log Loss Binary', 0.038403828027876195)])

Feature Importance

We can get the importance associated with each feature of the resulting pipeline

[4]:
pipeline.feature_importance
[4]:
feature importance
0 worst perimeter 0.176488
1 worst concave points 0.125260
2 worst radius 0.124161
3 mean concave points 0.086443
4 worst area 0.072465
5 mean concavity 0.072320
6 mean perimeter 0.056685
7 mean area 0.049599
8 area error 0.037229
9 worst concavity 0.028181
10 mean radius 0.023294
11 radius error 0.019457
12 worst texture 0.014990
13 perimeter error 0.014103
14 mean texture 0.013618
15 worst compactness 0.011310
16 worst smoothness 0.011139
17 worst fractal dimension 0.008118
18 worst symmetry 0.007818
19 mean smoothness 0.006152
20 concave points error 0.005887
21 fractal dimension error 0.005059
22 concavity error 0.004510
23 smoothness error 0.004493
24 texture error 0.004476
25 mean compactness 0.004050
26 compactness error 0.003559
27 mean symmetry 0.003243
28 symmetry error 0.003124
29 mean fractal dimension 0.002768

We can also create a bar plot of the feature importances

[5]:
pipeline.graph_feature_importance()

Permutation Importance

We can also compute and plot the permutation importance of the pipeline.

[6]:
from evalml.model_understanding.graphs import calculate_permutation_importance
calculate_permutation_importance(pipeline, X, y, 'log loss binary')
[6]:
feature importance
0 worst perimeter 0.083152
1 worst radius 0.078690
2 worst area 0.071237
3 worst concave points 0.071188
4 mean concave points 0.043834
5 worst concavity 0.040660
6 mean concavity 0.039079
7 area error 0.037576
8 mean area 0.027190
9 mean perimeter 0.026886
10 worst texture 0.017269
11 mean texture 0.013273
12 perimeter error 0.011904
13 mean radius 0.011215
14 radius error 0.011004
15 worst compactness 0.009072
16 worst smoothness 0.008203
17 mean smoothness 0.005717
18 worst symmetry 0.004561
19 worst fractal dimension 0.004273
20 concavity error 0.004138
21 compactness error 0.003855
22 concave points error 0.003221
23 mean compactness 0.003207
24 smoothness error 0.002949
25 fractal dimension error 0.002712
26 texture error 0.002541
27 mean fractal dimension 0.002305
28 symmetry error 0.002077
29 mean symmetry 0.001675
[7]:
from evalml.model_understanding.graphs import graph_permutation_importance
graph_permutation_importance(pipeline, X, y, 'log loss binary')

Partial Dependence Plots

We can calculate the one-way partial dependence plots for a feature.

[8]:
from evalml.model_understanding.graphs import partial_dependence
partial_dependence(pipeline, X, features='mean radius')
[8]:
feature_values partial_dependence class_label
0 9.498540 0.371141 malignant
1 9.610488 0.371141 malignant
2 9.722436 0.371141 malignant
3 9.834384 0.371141 malignant
4 9.946332 0.371141 malignant
... ... ... ...
95 20.133608 0.399560 malignant
96 20.245556 0.399560 malignant
97 20.357504 0.399560 malignant
98 20.469452 0.399560 malignant
99 20.581400 0.399560 malignant

100 rows × 3 columns

[9]:
from evalml.model_understanding.graphs import graph_partial_dependence
graph_partial_dependence(pipeline, X, features='mean radius')

Two-way partial dependence plots are also possible and invoke the same API.

[10]:
partial_dependence(pipeline, X, features=('worst perimeter', 'worst radius'), grid_resolution=10)
[10]:
10.5072 12.193377777777776 13.879555555555555 15.565733333333334 17.251911111111113 18.938088888888892 20.624266666666667 22.310444444444443 23.99662222222222 25.6828 class_label
67.733600 0.264908 0.267211 0.274328 0.286943 0.405865 0.442701 0.444406 0.444406 0.444406 0.444406 malignant
79.363867 0.265840 0.268142 0.275260 0.287875 0.405865 0.442701 0.444406 0.444406 0.444406 0.444406 malignant
90.994133 0.273397 0.275699 0.282817 0.295432 0.411805 0.448641 0.450346 0.450346 0.450346 0.450346 malignant
102.624400 0.298379 0.300681 0.307799 0.323472 0.436371 0.473207 0.474911 0.474911 0.474911 0.474911 malignant
114.254667 0.395976 0.398278 0.404798 0.417739 0.530867 0.567702 0.571516 0.571797 0.571797 0.571797 malignant
125.884933 0.426266 0.428569 0.435089 0.450433 0.556594 0.593430 0.597244 0.597525 0.597525 0.597525 malignant
137.515200 0.442004 0.444307 0.450827 0.466171 0.574301 0.611137 0.614950 0.615232 0.615232 0.615232 malignant
149.145467 0.442004 0.444307 0.450827 0.466171 0.574301 0.611137 0.614950 0.615232 0.615232 0.615232 malignant
160.775733 0.442004 0.444307 0.450827 0.466171 0.574301 0.611137 0.614950 0.615232 0.615232 0.615232 malignant
172.406000 0.442004 0.444307 0.450827 0.466171 0.574301 0.611137 0.614950 0.615232 0.615232 0.615232 malignant
[11]:
graph_partial_dependence(pipeline, X, features=('worst perimeter', 'worst radius'), grid_resolution=10)

Confusion Matrix

For binary or multiclass classification, we can view a confusion matrix of the classifier’s predictions. In the DataFrame output of confusion_matrix(), the column header represents the predicted labels while row header represents the actual labels.

[12]:
from evalml.model_understanding.graphs import confusion_matrix
y_pred = pipeline.predict(X)
confusion_matrix(y, y_pred)
[12]:
benign malignant
benign 1.000000 0.000000
malignant 0.009434 0.990566
[13]:
from evalml.model_understanding.graphs import graph_confusion_matrix
y_pred = pipeline.predict(X)
graph_confusion_matrix(y, y_pred)

Precision-Recall Curve

For binary classification, we can view the precision-recall curve of the pipeline.

[14]:
from evalml.model_understanding.graphs import graph_precision_recall_curve
# get the predicted probabilities associated with the "true" label
import woodwork as ww
y_encoded = y.to_series().map({'benign': 0, 'malignant': 1})
y_encoded = ww.DataColumn(y_encoded)
y_pred_proba = pipeline.predict_proba(X)["malignant"]
graph_precision_recall_curve(y_encoded, y_pred_proba)

ROC Curve

For binary and multiclass classification, we can view the Receiver Operating Characteristic (ROC) curve of the pipeline.

[15]:
from evalml.model_understanding.graphs import graph_roc_curve
# get the predicted probabilities associated with the "malignant" label
y_pred_proba = pipeline.predict_proba(X)["malignant"]
graph_roc_curve(y_encoded, y_pred_proba)

The ROC curve can also be generated for multiclass classification problems. For multiclass problems, the graph will show a one-vs-many ROC curve for each class.

[16]:
class RFMulticlassClassificationPipeline(evalml.pipelines.MulticlassClassificationPipeline):
    component_graph = ['Simple Imputer', 'Random Forest Classifier']

X_multi, y_multi = evalml.demos.load_wine()

pipeline_multi = RFMulticlassClassificationPipeline({})
pipeline_multi.fit(X_multi, y_multi)

y_pred_proba = pipeline_multi.predict_proba(X_multi)
graph_roc_curve(y_multi, y_pred_proba)

Binary Objective Score vs. Threshold Graph

Some binary classification objectives (objectives that have score_needs_proba set to False) are sensitive to a decision threshold. For those objectives, we can obtain and graph the scores for thresholds from zero to one, calculated at evenly-spaced intervals determined by steps.

[17]:
from evalml.model_understanding.graphs import binary_objective_vs_threshold
binary_objective_vs_threshold(pipeline, X, y, 'f1', steps=100)
[17]:
threshold score
0 0.00 0.542894
1 0.01 0.750442
2 0.02 0.815385
3 0.03 0.848000
4 0.04 0.874227
... ... ...
96 0.96 0.854054
97 0.97 0.835165
98 0.98 0.805634
99 0.99 0.722892
100 1.00 0.000000

101 rows × 2 columns

[18]:
from evalml.model_understanding.graphs import graph_binary_objective_vs_threshold
graph_binary_objective_vs_threshold(pipeline, X, y, 'f1', steps=100)

Predicted Vs Actual Values Graph for Regression Problems

We can also create a scatterplot comparing predicted vs actual values for regression problems. We can specify an outlier_threshold to color values differently if the absolute difference between the actual and predicted values are outside of a given threshold.

[19]:
from evalml.model_understanding.graphs import graph_prediction_vs_actual

class LinearRegressionPipeline(evalml.pipelines.RegressionPipeline):
    component_graph = ['One Hot Encoder', 'Linear Regressor']

X_regress, y_regress = evalml.demos.load_diabetes()
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X_regress, y_regress, problem_type='regression')

pipeline_regress = LinearRegressionPipeline({})
pipeline_regress.fit(X_train, y_train)

y_pred = pipeline_regress.predict(X_test)
graph_prediction_vs_actual(y_test, y_pred, outlier_threshold=50)

Explaining Predictions

We can explain why the model made certain predictions with the explain_predictions function. This will use the Shapley Additive Explanations (SHAP) algorithm to identify the top features that explain the predicted value.

This function can explain both classification and regression models - all you need to do is provide the pipeline, the input features, and a list of rows corresponding to the indices of the input features you want to explain. The function will return a table that you can print summarizing the top 3 most positive and negative contributing features to the predicted value.

In the example below, we explain the prediction for the third data point in the data set. We see that the worst concave points feature increased the estimated probability that the tumor is malignant by 20% while the worst radius feature decreased the probability the tumor is malignant by 5%.

[20]:
from evalml.model_understanding.prediction_explanations import explain_predictions

table = explain_predictions(pipeline=pipeline, input_features=X, y=None, indices_to_explain=[3],
                           top_k_features=6, include_shap_values=True)
print(table)
RFBinary Classification Pipeline

{'Simple Imputer': {'impute_strategy': 'most_frequent', 'fill_value': None}, 'Random Forest Classifier': {'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}

        1 of 1

                    Feature Name       Feature Value   Contribution to Prediction   SHAP Value
                ==============================================================================
                worst concave points       0.26                    ++                  0.20
                mean concave points        0.11                    +                   0.11
                   mean concavity          0.24                    +                   0.08
                  worst concavity          0.69                    +                   0.05
                  worst perimeter          98.87                   -                  -0.05
                    worst radius           14.91                   -                  -0.05



The interpretation of the table is the same for regression problems - but the SHAP value now corresponds to the change in the estimated value of the dependent variable rather than a change in probability. For multiclass classification problems, a table will be output for each possible class.

This functionality is currently not supported for XGBoost models or CatBoost multiclass classifiers.

Below is an example of how you would explain three predictions with explain_predictions.

[21]:
from evalml.model_understanding.prediction_explanations import explain_predictions

report = explain_predictions(pipeline=pipeline, input_features=X, y=y, indices_to_explain=[0, 4, 9], include_shap_values=True,
                            output_format='text')
print(report)
RFBinary Classification Pipeline

{'Simple Imputer': {'impute_strategy': 'most_frequent', 'fill_value': None}, 'Random Forest Classifier': {'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}

        1 of 3

                    Feature Name       Feature Value   Contribution to Prediction   SHAP Value
                ==============================================================================
                worst concave points       0.27                    +                   0.09
                  worst perimeter         184.60                   +                   0.09
                    worst radius           25.38                   +                   0.08


        2 of 3

                    Feature Name       Feature Value   Contribution to Prediction   SHAP Value
                ==============================================================================
                  worst perimeter         152.20                   +                   0.11
                    worst radius           22.54                   +                   0.09
                worst concave points       0.16                    +                   0.08


        3 of 3

                    Feature Name       Feature Value   Contribution to Prediction   SHAP Value
                ==============================================================================
                worst concave points       0.22                    ++                  0.20
                mean concave points        0.09                    +                   0.11
                   mean concavity          0.23                    +                   0.08



Explaining Best and Worst Predictions

When debugging machine learning models, it is often useful to analyze the best and worst predictions the model made. The explain_predictions_best_worst function can help us with this.

This function will display the output of explain_predictions for the best 2 and worst 2 predictions. By default, the best and worst predictions are determined by the absolute error for regression problems and cross entropy for classification problems.

We can specify our own ranking function by passing in a function to the metric parameter. This function will be called on y_true and y_pred. By convention, lower scores are better.

At the top of each table, we can see the predicted probabilities, target value, error, and row index for that prediction. For a regression problem, we would see the predicted value instead of predicted probabilities.

[22]:
from evalml.model_understanding.prediction_explanations import explain_predictions_best_worst

report = explain_predictions_best_worst(pipeline=pipeline, input_features=X, y_true=y,
                                        include_shap_values=True, top_k_features=6, num_to_explain=2)

print(report)
RFBinary Classification Pipeline

{'Simple Imputer': {'impute_strategy': 'most_frequent', 'fill_value': None}, 'Random Forest Classifier': {'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}

        Best 1 of 2

                Predicted Probabilities: [benign: 0.0, malignant: 1.0]
                Predicted Value: malignant
                Target Value: malignant
                Cross Entropy: 0.0
                Index ID: 168

                    Feature Name       Feature Value   Contribution to Prediction   SHAP Value
                ==============================================================================
                  worst perimeter         155.30                   +                   0.10
                    worst radius           23.14                   +                   0.08
                worst concave points       0.17                    +                   0.08
                     worst area           1660.00                  +                   0.06
                mean concave points        0.10                    +                   0.05
                     area error           122.30                   +                   0.04


        Best 2 of 2

                Predicted Probabilities: [benign: 0.0, malignant: 1.0]
                Predicted Value: malignant
                Target Value: malignant
                Cross Entropy: 0.0
                Index ID: 564

                    Feature Name       Feature Value   Contribution to Prediction   SHAP Value
                ==============================================================================
                  worst perimeter         166.10                   +                   0.10
                    worst radius           25.45                   +                   0.08
                worst concave points       0.22                    +                   0.08
                     worst area           2027.00                  +                   0.06
                mean concave points        0.14                    +                   0.05
                   mean concavity          0.24                    +                   0.05


        Worst 1 of 2

                Predicted Probabilities: [benign: 0.552, malignant: 0.448]
                Predicted Value: benign
                Target Value: malignant
                Cross Entropy: 0.802
                Index ID: 40

                   Feature Name       Feature Value   Contribution to Prediction   SHAP Value
                =============================================================================
                 smoothness error         0.00                    +                   0.04
                   mean texture           21.58                   +                   0.03
                   worst texture          30.25                   +                   0.02
                    worst area           787.90                   +                   0.02
                   worst radius           15.93                   -                  -0.03
                mean concave points       0.02                    -                  -0.03


        Worst 2 of 2

                Predicted Probabilities: [benign: 0.788, malignant: 0.212]
                Predicted Value: benign
                Target Value: malignant
                Cross Entropy: 1.55
                Index ID: 135

                    Feature Name       Feature Value   Contribution to Prediction   SHAP Value
                ==============================================================================
                   worst texture           33.37                   +                   0.05
                    mean texture           22.47                   +                   0.03
                mean concave points        0.03                    -                  -0.03
                worst concave points       0.09                    -                  -0.04
                    worst radius           14.49                   -                  -0.05
                  worst perimeter          92.04                   -                  -0.06



We use a custom metric (hinge loss) for selecting the best and worst predictions. See this example:

import numpy as np

def hinge_loss(y_true, y_pred_proba):

    probabilities = np.clip(y_pred_proba.iloc[:, 1], 0.001, 0.999)
    y_true[y_true == 0] = -1

    return np.clip(1 - y_true * np.log(probabilities / (1 - probabilities)), a_min=0, a_max=None)

report = explain_predictions_best_worst(pipeline=pipeline, input_features=X, y_true=y,
                                        include_shap_values=True, num_to_explain=5, metric=hinge_loss)

print(report)

Changing Output Formats

Instead of getting the prediction explanations as text, you can get the report as a python dictionary or pandas dataframe. All you have to do is pass output_format="dict" or output_format="dataframe" to either explain_prediction, explain_predictions, or explain_predictions_best_worst.

Single prediction as a dictionary

[23]:
import json
single_prediction_report = explain_predictions(pipeline=pipeline, input_features=X, indices_to_explain=[3],
                                              y=y, top_k_features=6, include_shap_values=True,
                                              output_format="dict")
print(json.dumps(single_prediction_report, indent=2))
{
  "explanations": [
    {
      "explanations": [
        {
          "feature_names": [
            "worst concave points",
            "mean concave points",
            "mean concavity",
            "worst concavity",
            "worst perimeter",
            "worst radius"
          ],
          "feature_values": [
            0.2575,
            0.1052,
            0.2414,
            0.6869,
            98.87,
            14.91
          ],
          "qualitative_explanation": [
            "++",
            "+",
            "+",
            "+",
            "-",
            "-"
          ],
          "quantitative_explanation": [
            0.19966729417702012,
            0.10648831456429969,
            0.07869244977813485,
            0.05150874542350735,
            -0.04930428857229847,
            -0.05034083333027343
          ],
          "drill_down": {},
          "class_name": "malignant"
        }
      ]
    }
  ]
}

Single prediction as a dataframe

[24]:
single_prediction_report = explain_predictions(pipeline=pipeline, input_features=X, indices_to_explain=[3],
                                              y=y, top_k_features=6, include_shap_values=True,
                                              output_format="dataframe")
single_prediction_report
[24]:
feature_names feature_values qualitative_explanation quantitative_explanation class_name prediction_number
0 worst concave points 0.2575 ++ 0.199667 malignant 0
1 mean concave points 0.1052 + 0.106488 malignant 0
2 mean concavity 0.2414 + 0.078692 malignant 0
3 worst concavity 0.6869 + 0.051509 malignant 0
4 worst perimeter 98.8700 - -0.049304 malignant 0
5 worst radius 14.9100 - -0.050341 malignant 0

Best and worst predictions as a dictionary

[25]:
report = explain_predictions_best_worst(pipeline=pipeline, input_features=X, y_true=y,
                                        num_to_explain=1, top_k_features=6,
                                        include_shap_values=True, output_format="dict")
print(json.dumps(report, indent=2))
{
  "explanations": [
    {
      "rank": {
        "prefix": "best",
        "index": 1
      },
      "predicted_values": {
        "probabilities": {
          "benign": 0.0,
          "malignant": 1.0
        },
        "predicted_value": "malignant",
        "target_value": "malignant",
        "error_name": "Cross Entropy",
        "error_value": 9.95074382629983e-05,
        "index_id": 168
      },
      "explanations": [
        {
          "feature_names": [
            "worst perimeter",
            "worst radius",
            "worst concave points",
            "worst area",
            "mean concave points",
            "area error"
          ],
          "feature_values": [
            155.3,
            23.14,
            0.1721,
            1660.0,
            0.1043,
            122.3
          ],
          "qualitative_explanation": [
            "+",
            "+",
            "+",
            "+",
            "+",
            "+"
          ],
          "quantitative_explanation": [
            0.09988982304983156,
            0.08240174808629956,
            0.07868368954615064,
            0.06242860386204596,
            0.051970789425386396,
            0.04459155806887927
          ],
          "drill_down": {},
          "class_name": "malignant"
        }
      ]
    },
    {
      "rank": {
        "prefix": "worst",
        "index": 1
      },
      "predicted_values": {
        "probabilities": {
          "benign": 0.788,
          "malignant": 0.212
        },
        "predicted_value": "benign",
        "target_value": "malignant",
        "error_name": "Cross Entropy",
        "error_value": 1.5499050281608746,
        "index_id": 135
      },
      "explanations": [
        {
          "feature_names": [
            "worst texture",
            "mean texture",
            "mean concave points",
            "worst concave points",
            "worst radius",
            "worst perimeter"
          ],
          "feature_values": [
            33.37,
            22.47,
            0.02704,
            0.09331,
            14.49,
            92.04
          ],
          "qualitative_explanation": [
            "+",
            "+",
            "-",
            "-",
            "-",
            "-"
          ],
          "quantitative_explanation": [
            0.05245422607466413,
            0.03035933540832274,
            -0.03461744299818247,
            -0.04174884967530769,
            -0.0491285663898271,
            -0.05666940833106337
          ],
          "drill_down": {},
          "class_name": "malignant"
        }
      ]
    }
  ]
}

Best and worst predictions as a dataframe

[26]:
report = explain_predictions_best_worst(pipeline=pipeline, input_features=X, y_true=y,
                                        num_to_explain=1, top_k_features=6,
                                        include_shap_values=True, output_format="dataframe")
report
[26]:
feature_names feature_values qualitative_explanation quantitative_explanation class_name label_benign_probability label_malignant_probability predicted_value target_value error_name error_value index_id rank prefix
0 worst perimeter 155.30000 + 0.099890 malignant 0.000 1.000 malignant malignant Cross Entropy 0.000100 168 1 best
1 worst radius 23.14000 + 0.082402 malignant 0.000 1.000 malignant malignant Cross Entropy 0.000100 168 1 best
2 worst concave points 0.17210 + 0.078684 malignant 0.000 1.000 malignant malignant Cross Entropy 0.000100 168 1 best
3 worst area 1660.00000 + 0.062429 malignant 0.000 1.000 malignant malignant Cross Entropy 0.000100 168 1 best
4 mean concave points 0.10430 + 0.051971 malignant 0.000 1.000 malignant malignant Cross Entropy 0.000100 168 1 best
5 area error 122.30000 + 0.044592 malignant 0.000 1.000 malignant malignant Cross Entropy 0.000100 168 1 best
6 worst texture 33.37000 + 0.052454 malignant 0.788 0.212 benign malignant Cross Entropy 1.549905 135 1 worst
7 mean texture 22.47000 + 0.030359 malignant 0.788 0.212 benign malignant Cross Entropy 1.549905 135 1 worst
8 mean concave points 0.02704 - -0.034617 malignant 0.788 0.212 benign malignant Cross Entropy 1.549905 135 1 worst
9 worst concave points 0.09331 - -0.041749 malignant 0.788 0.212 benign malignant Cross Entropy 1.549905 135 1 worst
10 worst radius 14.49000 - -0.049129 malignant 0.788 0.212 benign malignant Cross Entropy 1.549905 135 1 worst
11 worst perimeter 92.04000 - -0.056669 malignant 0.788 0.212 benign malignant Cross Entropy 1.549905 135 1 worst