graphs#

Model understanding graphing utilities.

Module Contents#

Functions#

binary_objective_vs_threshold

Computes objective score as a function of potential binary classification decision thresholds for a fitted binary classification pipeline.

decision_tree_data_from_estimator

Return data for a fitted tree in a restructured format.

decision_tree_data_from_pipeline

Return data for a fitted pipeline in a restructured format.

get_linear_coefficients

Returns a dataframe showing the features with the greatest predictive power for a linear model.

get_prediction_vs_actual_data

Combines y_true and y_pred into a single dataframe and adds a column for outliers. Used in graph_prediction_vs_actual().

get_prediction_vs_actual_over_time_data

Get the data needed for the prediction_vs_actual_over_time plot.

graph_binary_objective_vs_threshold

Generates a plot graphing objective score vs. decision thresholds for a fitted binary classification pipeline.

graph_permutation_importance

Generate a bar graph of the pipeline's permutation importance.

graph_prediction_vs_actual

Generate a scatter plot comparing the true and predicted values. Used for regression plotting.

graph_prediction_vs_actual_over_time

Plot the target values and predictions against time on the x-axis.

graph_t_sne

Plot high dimensional data into lower dimensional space using t-SNE.

t_sne

Get the transformed output after fitting X to the embedded space using t-SNE.

visualize_decision_tree

Generate an image visualizing the decision tree.

Contents#

evalml.model_understanding.graphs.binary_objective_vs_threshold(pipeline, X, y, objective, steps=100)[source]#

Computes objective score as a function of potential binary classification decision thresholds for a fitted binary classification pipeline.

Parameters
  • pipeline (BinaryClassificationPipeline obj) – Fitted binary classification pipeline.

  • X (pd.DataFrame) – The input data used to compute objective score.

  • y (pd.Series) – The target labels.

  • objective (ObjectiveBase obj, str) – Objective used to score.

  • steps (int) – Number of intervals to divide and calculate objective score at.

Returns

DataFrame with thresholds and the corresponding objective score calculated at each threshold.

Return type

pd.DataFrame

Raises
  • ValueError – If objective is not a binary classification objective.

  • ValueError – If objective’s score_needs_proba is not False.

evalml.model_understanding.graphs.decision_tree_data_from_estimator(estimator)[source]#

Return data for a fitted tree in a restructured format.

Parameters

estimator (ComponentBase) – A fitted DecisionTree-based estimator.

Returns

An OrderedDict of OrderedDicts describing a tree structure.

Return type

OrderedDict

Raises
  • ValueError – If estimator is not a decision tree-based estimator.

  • NotFittedError – If estimator is not yet fitted.

evalml.model_understanding.graphs.decision_tree_data_from_pipeline(pipeline_)[source]#

Return data for a fitted pipeline in a restructured format.

Parameters

pipeline (PipelineBase) – A pipeline with a DecisionTree-based estimator.

Returns

An OrderedDict of OrderedDicts describing a tree structure.

Return type

OrderedDict

Raises
  • ValueError – If estimator is not a decision tree-based estimator.

  • NotFittedError – If estimator is not yet fitted.

evalml.model_understanding.graphs.get_linear_coefficients(estimator, features=None)[source]#

Returns a dataframe showing the features with the greatest predictive power for a linear model.

Parameters
  • estimator (Estimator) – Fitted linear model family estimator.

  • features (list[str]) – List of feature names associated with the underlying data.

Returns

Displaying the features by importance.

Return type

pd.DataFrame

Raises
  • ValueError – If the model is not a linear model.

  • NotFittedError – If the model is not yet fitted.

evalml.model_understanding.graphs.get_prediction_vs_actual_data(y_true, y_pred, outlier_threshold=None)[source]#

Combines y_true and y_pred into a single dataframe and adds a column for outliers. Used in graph_prediction_vs_actual().

Parameters
  • y_true (pd.Series, or np.ndarray) – The real target values of the data

  • y_pred (pd.Series, or np.ndarray) – The predicted values outputted by the regression model.

  • outlier_threshold (int, float) – A positive threshold for what is considered an outlier value. This value is compared to the absolute difference between each value of y_true and y_pred. Values within this threshold will be blue, otherwise they will be yellow. Defaults to None.

Returns

  • prediction: Predicted values from regression model.

  • actual: Real target values.

  • outlier: Colors indicating which values are in the threshold for what is considered an outlier value.

Return type

pd.DataFrame with the following columns

Raises

ValueError – If threshold is not positive.

evalml.model_understanding.graphs.get_prediction_vs_actual_over_time_data(pipeline, X, y, X_train, y_train, dates)[source]#

Get the data needed for the prediction_vs_actual_over_time plot.

Parameters
  • pipeline (TimeSeriesRegressionPipeline) – Fitted time series regression pipeline.

  • X (pd.DataFrame) – Features used to generate new predictions.

  • y (pd.Series) – Target values to compare predictions against.

  • X_train (pd.DataFrame) – Data the pipeline was trained on.

  • y_train (pd.Series) – Target values for training data.

  • dates (pd.Series) – Dates corresponding to target values and predictions.

Returns

Predictions vs. time.

Return type

pd.DataFrame

evalml.model_understanding.graphs.graph_binary_objective_vs_threshold(pipeline, X, y, objective, steps=100)[source]#

Generates a plot graphing objective score vs. decision thresholds for a fitted binary classification pipeline.

Parameters
  • pipeline (PipelineBase or subclass) – Fitted pipeline

  • X (pd.DataFrame) – The input data used to score and compute scores

  • y (pd.Series) – The target labels

  • objective (ObjectiveBase obj, str) – Objective used to score, shown on the y-axis of the graph

  • steps (int) – Number of intervals to divide and calculate objective score at

Returns

plotly.Figure representing the objective score vs. threshold graph generated

evalml.model_understanding.graphs.graph_permutation_importance(pipeline, X, y, objective, importance_threshold=0)[source]#

Generate a bar graph of the pipeline’s permutation importance.

Parameters
  • pipeline (PipelineBase or subclass) – Fitted pipeline.

  • X (pd.DataFrame) – The input data used to score and compute permutation importance.

  • y (pd.Series) – The target data.

  • objective (str, ObjectiveBase) – Objective to score on.

  • importance_threshold (float, optional) – If provided, graph features with a permutation importance whose absolute value is larger than importance_threshold. Defaults to 0.

Returns

plotly.Figure, a bar graph showing features and their respective permutation importance.

Raises

ValueError – If importance_threshold is not greater than or equal to 0.

evalml.model_understanding.graphs.graph_prediction_vs_actual(y_true, y_pred, outlier_threshold=None)[source]#

Generate a scatter plot comparing the true and predicted values. Used for regression plotting.

Parameters
  • y_true (pd.Series) – The real target values of the data.

  • y_pred (pd.Series) – The predicted values outputted by the regression model.

  • outlier_threshold (int, float) – A positive threshold for what is considered an outlier value. This value is compared to the absolute difference between each value of y_true and y_pred. Values within this threshold will be blue, otherwise they will be yellow. Defaults to None.

Returns

plotly.Figure representing the predicted vs. actual values graph

Raises

ValueError – If threshold is not positive.

evalml.model_understanding.graphs.graph_prediction_vs_actual_over_time(pipeline, X, y, X_train, y_train, dates)[source]#

Plot the target values and predictions against time on the x-axis.

Parameters
  • pipeline (TimeSeriesRegressionPipeline) – Fitted time series regression pipeline.

  • X (pd.DataFrame) – Features used to generate new predictions.

  • y (pd.Series) – Target values to compare predictions against.

  • X_train (pd.DataFrame) – Data the pipeline was trained on.

  • y_train (pd.Series) – Target values for training data.

  • dates (pd.Series) – Dates corresponding to target values and predictions.

Returns

Showing the prediction vs actual over time.

Return type

plotly.Figure

Raises

ValueError – If the pipeline is not a time-series regression pipeline.

evalml.model_understanding.graphs.graph_t_sne(X, n_components=2, perplexity=30.0, learning_rate=200.0, metric='euclidean', marker_line_width=2, marker_size=7, **kwargs)[source]#

Plot high dimensional data into lower dimensional space using t-SNE.

Parameters
  • X (np.ndarray, pd.DataFrame) – Data to be transformed. Must be numeric.

  • n_components (int) – Dimension of the embedded space. Defaults to 2.

  • perplexity (float) – Related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Defaults to 30.

  • learning_rate (float) – Usually in the range [10.0, 1000.0]. If the cost function gets stuck in a bad local minimum, increasing the learning rate may help. Must be positive. Defaults to 200.

  • metric (str) – The metric to use when calculating distance between instances in a feature array. The default is “euclidean” which is interpreted as the squared euclidean distance.

  • marker_line_width (int) – Determines the line width of the marker boundary. Defaults to 2.

  • marker_size (int) – Determines the size of the marker. Defaults to 7.

  • kwargs – Arbitrary keyword arguments.

Returns

Figure representing the transformed data.

Return type

plotly.Figure

Raises

ValueError – If marker_line_width or marker_size are not valid values.

evalml.model_understanding.graphs.t_sne(X, n_components=2, perplexity=30.0, learning_rate=200.0, metric='euclidean', **kwargs)[source]#

Get the transformed output after fitting X to the embedded space using t-SNE.

Args:

X (np.ndarray, pd.DataFrame): Data to be transformed. Must be numeric. n_components (int, optional): Dimension of the embedded space. perplexity (float, optional): Related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. learning_rate (float, optional): Usually in the range [10.0, 1000.0]. If the cost function gets stuck in a bad local minimum, increasing the learning rate may help. metric (str, optional): The metric to use when calculating distance between instances in a feature array. kwargs: Arbitrary keyword arguments.

Returns

TSNE output.

Return type

np.ndarray (n_samples, n_components)

Raises

ValueError – If specified parameters are not valid values.

evalml.model_understanding.graphs.visualize_decision_tree(estimator, max_depth=None, rotate=False, filled=False, filepath=None)[source]#

Generate an image visualizing the decision tree.

Parameters
  • estimator (ComponentBase) – A fitted DecisionTree-based estimator.

  • max_depth (int, optional) – The depth to which the tree should be displayed. If set to None (as by default), tree is fully generated.

  • rotate (bool, optional) – Orient tree left to right rather than top-down.

  • filled (bool, optional) – Paint nodes to indicate majority class for classification, extremity of values for regression, or purity of node for multi-output.

  • filepath (str, optional) – Path to where the graph should be saved. If set to None (as by default), the graph will not be saved.

Returns

DOT object that can be directly displayed in Jupyter notebooks.

Return type

graphviz.Source

Raises
  • ValueError – If estimator is not a decision tree-based estimator.

  • NotFittedError – If estimator is not yet fitted.