Model Understanding#

Simply examining a model’s performance metrics is not enough to select a model and promote it for use in a production setting. While developing an ML algorithm, it is important to understand how the model behaves on the data, to examine the key factors influencing its predictions and to consider where it may be deficient. Determination of what “success” may mean for an ML project depends first and foremost on the user’s domain expertise.

EvalML includes a variety of tools for understanding models, from graphing utilities to methods for explaining predictions.

** Graphing methods on Jupyter Notebook and Jupyter Lab require ipywidgets to be installed.

** If graphing on Jupyter Lab, jupyterlab-plotly required. To download this, make sure you have npm installed.

Explaining Feature Influence#

The EvalML package offers a variety of methods for understanding which features in a dataset have an impact on the output of the model. We can investigate this either through feature importance or through permutation importance, and leverage either in generating more readable explanations.

First, let’s train a pipeline on some data.

[1]:

import evalml
from evalml.pipelines import BinaryClassificationPipeline

X, y = evalml.demos.load_breast_cancer()

X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2, random_seed=0
)


pipeline_binary = BinaryClassificationPipeline(
    component_graph={
        "Label Encoder": ["Label Encoder", "X", "y"],
        "Imputer": ["Imputer", "X", "Label Encoder.y"],
        "Random Forest Classifier": [
            "Random Forest Classifier",
            "Imputer.x",
            "Label Encoder.y",
        ],
    }
)
pipeline_binary.fit(X_train, y_train)
print(pipeline_binary.score(X_holdout, y_holdout, objectives=["log loss binary"]))

         Number of Features
Numeric                  30

Number of training examples: 569
Targets
benign       62.74%
malignant    37.26%
Name: count, dtype: object

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

OrderedDict([('Log Loss Binary', 0.1686746297113362)])

Feature Importance#

We can get the importance associated with each feature of the resulting pipeline

[2]:

pipeline_binary.feature_importance

[2]:

	feature	importance
0	mean concave points	0.138857
1	worst perimeter	0.137780
2	worst concave points	0.117782
3	worst radius	0.100584
4	mean concavity	0.086402
5	worst area	0.072027
6	mean perimeter	0.046500
7	worst concavity	0.043408
8	mean radius	0.037664
9	mean area	0.033683
10	radius error	0.025036
11	area error	0.019324
12	worst texture	0.014754
13	worst compactness	0.014462
14	mean texture	0.013856
15	worst smoothness	0.013710
16	worst symmetry	0.011395
17	perimeter error	0.010284
18	mean compactness	0.008162
19	mean smoothness	0.008154
20	worst fractal dimension	0.007034
21	fractal dimension error	0.005502
22	compactness error	0.004953
23	smoothness error	0.004728
24	texture error	0.004384
25	symmetry error	0.004250
26	mean fractal dimension	0.004164
27	concavity error	0.004089
28	mean symmetry	0.003997
29	concave points error	0.003076

We can also create a bar plot of the feature importances

[3]:

pipeline_binary.graph_feature_importance()

If we have a linear model, we can also view feature importance by simply inspecting the coefficients of the model.

[4]:

from evalml.model_understanding import get_linear_coefficients

pipeline_linear = BinaryClassificationPipeline(
    component_graph={
        "Label Encoder": ["Label Encoder", "X", "y"],
        "Imputer": ["Imputer", "X", "Label Encoder.y"],
        "Logistic Regression Classifier": [
            "Logistic Regression Classifier",
            "Imputer.x",
            "Label Encoder.y",
        ],
    }
)
pipeline_linear.fit(X_train, y_train)

get_linear_coefficients(pipeline_linear.estimator, features=X.columns)

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

[4]:

Intercept                 -0.339181
worst radius              -1.777283
mean radius               -1.674112
texture error             -0.740383
perimeter error           -0.288266
mean texture              -0.081338
radius error              -0.076170
mean perimeter            -0.069128
mean area                  0.002720
fractal dimension error    0.005759
smoothness error           0.006098
symmetry error             0.019005
mean fractal dimension     0.020053
worst area                 0.021615
concave points error       0.022536
compactness error          0.058227
mean smoothness            0.073213
concavity error            0.084693
mean symmetry              0.086924
worst fractal dimension    0.098952
area error                 0.115528
worst smoothness           0.126151
mean concave points        0.183110
worst texture              0.258570
worst symmetry             0.274830
worst perimeter            0.296383
mean compactness           0.308766
worst concave points       0.348138
mean concavity             0.423376
worst compactness          0.945473
worst concavity            1.189651
dtype: float64

Permutation Importance#

We can also compute and plot the permutation importance of the pipeline.

[5]:

from evalml.model_understanding import calculate_permutation_importance

calculate_permutation_importance(
    pipeline_binary, X_holdout, y_holdout, "log loss binary"
)

[5]:

	feature	importance
0	worst perimeter	0.063657
1	worst area	0.045759
2	worst radius	0.041926
3	mean concave points	0.029325
4	worst concave points	0.021045
5	worst concavity	0.010105
6	worst texture	0.010044
7	mean texture	0.006178
8	mean symmetry	0.005857
9	mean area	0.004745
10	worst smoothness	0.003190
11	area error	0.003113
12	mean perimeter	0.002478
13	mean fractal dimension	0.001981
14	compactness error	0.001968
15	concavity error	0.001947
16	texture error	0.000291
17	smoothness error	-0.000206
18	mean smoothness	-0.000745
19	fractal dimension error	-0.000835
20	worst compactness	-0.002392
21	mean concavity	-0.003188
22	mean compactness	-0.005377
23	radius error	-0.006229
24	mean radius	-0.006870
25	worst fractal dimension	-0.007415
26	symmetry error	-0.008175
27	perimeter error	-0.008980
28	concave points error	-0.010415
29	worst symmetry	-0.018645

[6]:

from evalml.model_understanding import graph_permutation_importance

graph_permutation_importance(pipeline_binary, X_holdout, y_holdout, "log loss binary")

Human Readable Importance#

We can generate a more human-comprehensible understanding of either the feature or permutation importance by using readable_explanation(pipeline). This picks out a subset of features that have the highest impact on the output of the model, sorting them into either “heavily” or “somewhat” influential on the model. These features are selected either by feature importance or permutation importance with a given objective. If there are any features that actively decrease the performance of the pipeline, this function highlights those and recommends removal.

Note that permutation importance runs on the original input features, while feature importance runs on the features as they were passed in to the final estimator, having gone through a number of preprocessing steps. The two methods will highlight different features as being important, and feature names may vary as well.

[7]:

from evalml.model_understanding import readable_explanation

readable_explanation(
    pipeline_binary,
    X_holdout,
    y_holdout,
    objective="log loss binary",
    importance_method="permutation",
)

Random Forest Classifier: The output as measured by log loss binary is heavily influenced by worst perimeter, and is somewhat influenced by worst area, worst radius, mean concave points, and worst concave points.
The features smoothness error, mean smoothness, fractal dimension error, worst compactness, mean concavity, mean compactness, radius error, mean radius, worst fractal dimension, symmetry error, perimeter error, concave points error, and worst symmetry detracted from model performance. We suggest removing these features.

[8]:

readable_explanation(
    pipeline_binary, importance_method="feature"
)  # feature importance doesn't require X and y

Random Forest Classifier: The output is somewhat influenced by mean concave points, worst perimeter, worst concave points, worst radius, and mean concavity.

We can adjust the number of most important features visible with the max_features argument, or modify the minimum threshold for “importance” with min_importance_threshold. However, these values will not affect any detrimental features displayed, as this function always displays all of them.

Metrics for Model Understanding#

Confusion Matrix#

For binary or multiclass classification, we can view a confusion matrix of the classifier’s predictions. In the DataFrame output of confusion_matrix(), the column header represents the predicted labels while row header represents the actual labels.

[9]:

from evalml.model_understanding.metrics import confusion_matrix

y_pred = pipeline_binary.predict(X_holdout)
confusion_matrix(y_holdout, y_pred)

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

[9]:

	benign	malignant
benign	0.930556	0.069444
malignant	0.023810	0.976190

[10]:

from evalml.model_understanding.metrics import graph_confusion_matrix

y_pred = pipeline_binary.predict(X_holdout)
graph_confusion_matrix(y_holdout, y_pred)

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

Precision-Recall Curve#

For binary classification, we can view the precision-recall curve of the pipeline.

[11]:

from evalml.model_understanding.metrics import graph_precision_recall_curve

# get the predicted probabilities associated with the "true" label
import woodwork as ww

y_encoded = y_holdout.ww.map({"benign": 0, "malignant": 1})
y_pred_proba = pipeline_binary.predict_proba(X_holdout)["malignant"]
graph_precision_recall_curve(y_encoded, y_pred_proba)

ROC Curve#

For binary and multiclass classification, we can view the Receiver Operating Characteristic (ROC) curve of the pipeline.

[12]:

from evalml.model_understanding.metrics import graph_roc_curve

# get the predicted probabilities associated with the "malignant" label
y_pred_proba = pipeline_binary.predict_proba(X_holdout)["malignant"]
graph_roc_curve(y_encoded, y_pred_proba)

The ROC curve can also be generated for multiclass classification problems. For multiclass problems, the graph will show a one-vs-many ROC curve for each class.

[13]:

from evalml.pipelines import MulticlassClassificationPipeline

X_multi, y_multi = evalml.demos.load_wine()

pipeline_multi = MulticlassClassificationPipeline(
    ["Simple Imputer", "Random Forest Classifier"]
)
pipeline_multi.fit(X_multi, y_multi)

y_pred_proba = pipeline_multi.predict_proba(X_multi)
graph_roc_curve(y_multi, y_pred_proba)

         Number of Features
Numeric                  13

Number of training examples: 178
Targets
class_1    39.89%
class_0    33.15%
class_2    26.97%
Name: count, dtype: object

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/main/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

Visualizations#

Binary Objective Score vs. Threshold Graph#

Some binary classification objectives (objectives that have score_needs_proba set to False) are sensitive to a decision threshold. For those objectives, we can obtain and graph the scores for thresholds from zero to one, calculated at evenly-spaced intervals determined by steps.

[14]:

from evalml.model_understanding.visualizations import binary_objective_vs_threshold

binary_objective_vs_threshold(pipeline_binary, X_holdout, y_holdout, "f1", steps=10)

[14]:

	threshold	score
0	0.0	0.538462
1	0.1	0.811881
2	0.2	0.891304
3	0.3	0.901099
4	0.4	0.931818
5	0.5	0.931818
6	0.6	0.941176
7	0.7	0.951220
8	0.8	0.936709
9	0.9	0.923077
10	1.0	0.000000

[15]:

from evalml.model_understanding.visualizations import (
    graph_binary_objective_vs_threshold,
)

graph_binary_objective_vs_threshold(
    pipeline_binary, X_holdout, y_holdout, "f1", steps=100
)

Predicted Vs Actual Values Graph for Regression Problems#

We can also create a scatterplot comparing predicted vs actual values for regression problems. We can specify an outlier_threshold to color values differently if the absolute difference between the actual and predicted values are outside of a given threshold.

[16]:

from evalml.model_understanding.visualizations import graph_prediction_vs_actual
from evalml.pipelines import RegressionPipeline

X_regress, y_regress = evalml.demos.load_diabetes()
X_train_reg, X_test_reg, y_train_reg, y_test_reg = evalml.preprocessing.split_data(
    X_regress, y_regress, problem_type="regression"
)

pipeline_regress = RegressionPipeline(["One Hot Encoder", "Linear Regressor"])
pipeline_regress.fit(X_train_reg, y_train_reg)

y_pred = pipeline_regress.predict(X_test_reg)
graph_prediction_vs_actual(y_test_reg, y_pred, outlier_threshold=50)

         Number of Features
Numeric                  10

Number of training examples: 442
Targets
72     1.36%
200    1.36%
178    1.13%
71     1.13%
90     1.13%
       ...
136    0.23%
295    0.23%
79     0.23%
25     0.23%
195    0.23%
Name: count, Length: 214, dtype: object

Tree Visualization#

Now let’s train a decision tree on some data. We can visualize the structure of the Decision Tree that was fit to that data, and save it if necessary.

[17]:

pipeline_dt = BinaryClassificationPipeline(
    ["Simple Imputer", "Decision Tree Classifier"]
)
pipeline_dt.fit(X_train, y_train)

[17]:

pipeline = BinaryClassificationPipeline(component_graph={'Simple Imputer': ['Simple Imputer', 'X', 'y'], 'Decision Tree Classifier': ['Decision Tree Classifier', 'Simple Imputer.x', 'y']}, parameters={'Simple Imputer':{'impute_strategy': 'most_frequent', 'fill_value': None}, 'Decision Tree Classifier':{'criterion': 'gini', 'max_features': 'sqrt', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0}}, random_seed=0)

[18]:

from evalml.model_understanding.visualizations import visualize_decision_tree

visualize_decision_tree(
    pipeline_dt.estimator, max_depth=2, rotate=False, filled=True, filepath=None
)

[18]:

../_images/user_guide_model_understanding_35_0.svg

Confusion Matrix and Thresholds for Binary Classification Pipelines#

For binary classification pipelines, EvalML also provides the ability to compare the actual positive and actual negative histograms, as well as obtaining the confusion matrices and ideal thresholds per objective.

[19]:

from evalml.model_understanding import find_confusion_matrix_per_thresholds

df, objective_thresholds = find_confusion_matrix_per_thresholds(
    pipeline_binary, X, y, n_bins=10
)
df.head(10)

[19]:

	true_pos_count	true_neg_count	true_positives	true_negatives	false_positives	false_negatives	data_in_bins
0.1	1	309	211	309	48	1	[19, 20, 21, 37, 46]
0.2	0	35	211	344	13	1	[68, 92, 123, 133, 147]
0.3	0	5	211	349	8	1	[112, 157, 484, 491, 505]
0.4	0	3	211	352	5	1	[208, 340, 465]
0.5	0	0	211	352	5	1	[]
0.6	3	2	208	354	3	4	[40, 89, 128, 263, 297]
0.7	2	2	206	356	1	6	[13, 81, 385, 421]
0.8	9	1	197	357	0	15	[38, 41, 54, 73, 86]
0.9	15	0	182	357	0	30	[39, 44, 91, 99, 100]
1.0	182	0	0	357	0	212	[0, 1, 2, 3, 4]

[20]:

objective_thresholds

[20]:

{'accuracy': {'objective score': 0.9894551845342706, 'threshold value': 0.4},
 'balanced_accuracy': {'objective score': 0.9906387083135141,
  'threshold value': 0.4},
 'precision': {'objective score': 1.0, 'threshold value': 0.8},
 'f1': {'objective score': 0.9859813084112149, 'threshold value': 0.4}}

In the above results, the first dataframe contains the histograms for the actual positive and negative classes, indicated by true_pos_count and true_neg_count. The columns true_positives, true_negatives, false_positives, and false_negatives contain the confusion matrix information for the associated threshold, and the data_in_bins holds a random subset of row indices (both postive and negative) that belong in each bin. The index of the dataframe represents the associated threshold. For instance, at index 0.1, there is 1 positive and 309 negative rows that fall between [0.0, 0.1].

The returned objective_thresholds dictionary has the objective measure as the key, and the dictionary value associated contains both the best objective score and the threshold that results in the associated score.

Visualize high dimensional data in lower space#

We can use T-SNE to visualize data with many features on a 2D plot, making it easier to see relationships in your data.

[21]:

# Our data is highly dimensional, we can't plot this in a way we understand
print(len(X.columns))

[22]:

from evalml.model_understanding import graph_t_sne

fig = graph_t_sne(X)
fig

Partial Dependence Plots#

We can calculate the one-way partial dependence plots for a feature.

[23]:

from evalml.model_understanding import partial_dependence

partial_dependence(
    pipeline_binary, X_holdout, features="mean radius", grid_resolution=5
)

[23]:

	feature_values	partial_dependence	class_label
0	9.69092	0.392453	malignant
1	12.40459	0.395962	malignant
2	15.11826	0.417396	malignant
3	17.83193	0.429542	malignant
4	20.54560	0.429717	malignant

[24]:

from evalml.model_understanding import graph_partial_dependence

graph_partial_dependence(
    pipeline_binary, X_holdout, features="mean radius", grid_resolution=5
)

We can also compute the partial dependence for a categorical feature. We will demonstrate this on the fraud dataset.

[25]:

X_fraud, y_fraud = evalml.demos.load_fraud(100, verbose=False)
X_fraud.ww.init(
    logical_types={
        "provider": "Categorical",
        "region": "Categorical",
        "currency": "Categorical",
        "expiration_date": "Categorical",
    }
)

fraud_pipeline = BinaryClassificationPipeline(
    ["DateTime Featurizer", "One Hot Encoder", "Random Forest Classifier"]
)
fraud_pipeline.fit(X_fraud, y_fraud)

graph_partial_dependence(fraud_pipeline, X_fraud, features="provider")