partial_dependence_functions
=================================================================
.. py:module:: evalml.model_understanding.partial_dependence_functions
.. autoapi-nested-parse::
Top level functions for running partial dependence.
Module Contents
---------------
Functions
~~~~~~~~~
.. autoapisummary::
:nosignatures:
evalml.model_understanding.partial_dependence_functions.graph_partial_dependence
evalml.model_understanding.partial_dependence_functions.partial_dependence
Contents
~~~~~~~~~~~~~~~~~~~
.. py:function:: graph_partial_dependence(pipeline, X, features, class_label=None, grid_resolution=100, kind='average')
Create an one-way or two-way partial dependence plot.
Passing a single integer or string as features will create a one-way partial dependence plot with the feature values
plotted against the partial dependence. Passing features a tuple of int/strings will create
a two-way partial dependence plot with a contour of feature[0] in the y-axis, feature[1]
in the x-axis and the partial dependence in the z-axis.
:param pipeline: Fitted pipeline.
:type pipeline: PipelineBase or subclass
:param X: The input data used to generate a grid of values
for feature where partial dependence will be calculated at.
:type X: pd.DataFrame, np.ndarray
:param features: The target feature for which to create the partial dependence plot for.
If features is an int, it must be the index of the feature to use.
If features is a string, it must be a valid column name in X.
If features is a tuple of strings, it must contain valid column int/names in X.
:type features: int, string, tuple[int or string]
:param class_label: Name of class to plot for multiclass problems. If None, will plot
the partial dependence for each class. This argument does not change behavior for regression or binary
classification pipelines. For binary classification, the partial dependence for the positive label will
always be displayed. Defaults to None.
:type class_label: string, optional
:param grid_resolution: Number of samples of feature(s) for partial dependence plot.
:type grid_resolution: int
:param kind: Type of partial dependence to plot. 'average' creates a regular partial dependence
(PD) graph, 'individual' creates an individual conditional expectation (ICE) plot, and 'both' creates a
single-figure PD and ICE plot. ICE plots can only be shown for one-way partial dependence plots.
:type kind: {'average', 'individual', 'both'}
:returns: figure object containing the partial dependence data for plotting
:rtype: plotly.graph_objects.Figure
:raises PartialDependenceError: if a graph is requested for a class name that isn't present in the pipeline.
:raises PartialDependenceError: if an ICE plot is requested for a two-way partial dependence.
.. py:function:: partial_dependence(pipeline, X, features, percentiles=(0.05, 0.95), grid_resolution=100, kind='average', fast_mode=False, X_train=None, y_train=None)
Calculates one or two-way partial dependence.
If a single integer or string is given for features, one-way partial dependence is calculated. If
a tuple of two integers or strings is given, two-way partial dependence
is calculated with the first feature in the y-axis and second feature in the x-axis.
:param pipeline: Fitted pipeline
:type pipeline: PipelineBase or subclass
:param X: The input data used to generate a grid of values
for feature where partial dependence will be calculated at
:type X: pd.DataFrame, np.ndarray
:param features: The target feature for which to create the partial dependence plot for.
If features is an int, it must be the index of the feature to use.
If features is a string, it must be a valid column name in X.
If features is a tuple of int/strings, it must contain valid column integers/names in X.
:type features: int, string, tuple[int or string]
:param percentiles: The lower and upper percentile used to create the extreme values for the grid.
Must be in [0, 1]. Defaults to (0.05, 0.95).
:type percentiles: tuple[float]
:param grid_resolution: Number of samples of feature(s) for partial dependence plot. If this value
is less than the maximum number of categories present in categorical data within X, it will be
set to the max number of categories + 1. Defaults to 100.
:type grid_resolution: int
:param kind: The type of predictions to return. 'individual' will return the predictions for
all of the points in the grid for each sample in X. 'average' will return the predictions for all of the points in
the grid but averaged over all of the samples in X.
:type kind: {'average', 'individual', 'both'}
:param fast_mode: Whether or not performance optimizations should be
used for partial dependence calculations. Defaults to False.
Note that user-specified components may not produce correct partial dependence results, so fast mode
should only be used with EvalML-native components. Additionally, some components are not compatible
with fast mode; in those cases, an error will be raised indicating that fast mode should not be used.
:type fast_mode: bool, optional
:param X_train: The data that was used to train the original pipeline. Will
be used in fast mode to train the cloned pipelines. Defaults to None.
:type X_train: pd.DataFrame, np.ndarray
:param y_train: The target data that was used to train the original pipeline. Will
be used in fast mode to train the cloned pipelines. Defaults to None.
:type y_train: pd.Series, np.ndarray
:returns: When `kind='average'`: DataFrame with averaged predictions for all points in the grid averaged
over all samples of X and the values used to calculate those predictions.
When `kind='individual'`: DataFrame with individual predictions for all points in the grid for each sample
of X and the values used to calculate those predictions. If a two-way partial dependence is calculated, then
the result is a list of DataFrames with each DataFrame representing one sample's predictions.
When `kind='both'`: A tuple consisting of the averaged predictions (in a DataFrame) over all samples of X and the individual
predictions (in a list of DataFrames) for each sample of X.
In the one-way case: The dataframe will contain two columns, "feature_values" (grid points at which the
partial dependence was calculated) and "partial_dependence" (the partial dependence at that feature value).
For classification problems, there will be a third column called "class_label" (the class label for which
the partial dependence was calculated). For binary classification, the partial dependence is only calculated
for the "positive" class.
In the two-way case: The data frame will contain grid_resolution number of columns and rows where the
index and column headers are the sampled values of the first and second features, respectively, used to make
the partial dependence contour. The values of the data frame contain the partial dependence data for each
feature value pair.
:rtype: pd.DataFrame, list(pd.DataFrame), or tuple(pd.DataFrame, list(pd.DataFrame))
:raises ValueError: Error during call to scikit-learn's partial dependence method.
:raises Exception: All other errors during calculation.
:raises PartialDependenceError: if the user provides a tuple of not exactly two features.
:raises PartialDependenceError: if the provided pipeline isn't fitted.
:raises PartialDependenceError: if the provided pipeline is a Baseline pipeline.
:raises PartialDependenceError: if any of the features passed in are completely NaN
:raises PartialDependenceError: if any of the features are low-variance. Defined as having one value occurring more than the upper
percentile passed by the user. By default 95%.