Component Graphs#
EvalML component graphs represent and describe the flow of data in a collection of related components. A component graph is comprised of nodes representing components, and edges between pairs of nodes representing where the inputs and outputs of each component should go. It is the backbone of the features offered by the EvalML pipeline, but is also a powerful data structure on its own. EvalML currently supports component graphs as linear and directed acyclic graphs (DAG).
Defining a Component Graph#
Component graphs can be defined by specifying the dictionary of components and edges that describe the graph.
In this dictionary, each key is a reference name for a component. Each corresponding value is a list, where the first element is the component itself, and the remaining elements are the input edges that should be connected to that component. The component as listed in the value can either be the component object itself or its string name.
This stucture is very similar to that of Dask computation graphs.
For example, in the code example below, we have a simple component graph made up of two components: an Imputer and a Random Forest Classifer. The names used to reference these two components are given by the keys, “My Imputer” and “RF Classifier” respectively. Each value in the dictionary is a list where the first element is the component corresponding to the component name, and the remaining elements are the inputs, e.g. “My Imputer” represents an Imputer component which has inputs “X” (the original features matrix) and “y” (the original target).
Feature edges are specified as "X"
or "{component_name}.x"
. For example, {"My Component": [MyComponent, "Imputer.x", ...]}
indicates that we should use the feature output of the Imputer
as as part of the feature input for MyComponent. Similarly, target edges are specified as "y"
or "{component_name}.y". {"My Component": [MyComponent, "Target Imputer.y", ...]}
indicates that we should use the target output of the Target Imputer
as a target input for MyComponent.
Each component can have a number of feature inputs, but can only have one target input. All input edges must be explicitly defined.
Using a real example, we define a simple component graph consisting of three nodes: an Imputer (“My Imputer”), an One-Hot Encoder (“OHE”), and a Random Forest Classifier (“RF Classifier”).
“My Imputer” takes the original X as a features input, and the original y as the target input
“OHE” also takes the original X as a features input, and the original y as the target input
“RF Classifer” takes the concatted feature outputs from “My Imputer” and “OHE” as a features input, and the original y as the target input.
[1]:
from evalml.pipelines import ComponentGraph
component_dict = {
"My Imputer": ["Imputer", "X", "y"],
"OHE": ["One Hot Encoder", "X", "y"],
"RF Classifier": [
"Random Forest Classifier",
"My Imputer.x",
"OHE.x",
"y",
], # takes in multiple feature inputs
}
cg_simple = ComponentGraph(component_dict)
All component graphs must end with one final or terminus node. This can either be a transformer or an estimator. Below, the component graph is invalid because has two terminus nodes: the “RF Classifier” and the “EN Classifier”.
[2]:
# Can't instantiate a component graph with more than one terminus node (here: RF Classifier, EN Classifier)
component_dict = {
"My Imputer": ["Imputer", "X", "y"],
"RF Classifier": ["Random Forest Classifier", "My Imputer.x", "y"],
"EN Classifier": ["Elastic Net Classifier", "My Imputer.x", "y"],
}
Once we have defined a component graph, we can instantiate the graph with specific parameter values for each component using .instantiate(parameters)
. All components in a component graph must be instantiated before fitting, transforming, or predicting.
Below, we instantiate our graph and set the value of our Imputer’s numeric_impute_strategy
to “most_frequent”.
[3]:
cg_simple.instantiate({"My Imputer": {"numeric_impute_strategy": "most_frequent"}})
[3]:
{'My Imputer': ['Imputer', 'X', 'y'], 'OHE': ['One Hot Encoder', 'X', 'y'], 'RF Classifier': ['Random Forest Classifier', 'My Imputer.x', 'OHE.x', 'y']}
Components in the Component Graph#
You can use .get_component(name)
and provide the unique component name to access any component in the component graph. Below, we can grab our Imputer component and confirm that numeric_impute_strategy
has indeed been set to “most_frequent”.
[4]:
cg_simple.get_component("My Imputer")
[4]:
Imputer(categorical_impute_strategy='most_frequent', numeric_impute_strategy='most_frequent', boolean_impute_strategy='most_frequent', categorical_fill_value=None, numeric_fill_value=None, boolean_fill_value=None)
You can also .get_inputs(name)
and provide the unique component name to to retrieve all inputs for that component.
Below, we can grab our “RF Classifier” component and confirm that we use "My Imputer.x"
as our features input and "y"
as target input.
[5]:
cg_simple.get_inputs("RF Classifier")
[5]:
['My Imputer.x', 'OHE.x', 'y']
Component Graph Computation Order#
Upon initalization, each component graph will generate a topological order. We can access this generated order by calling the .compute_order
attribute. This attribute is used to determine the order that components should be evaluated during calls to fit
and transform
.
[6]:
cg_simple.compute_order
[6]:
['My Imputer', 'OHE', 'RF Classifier']
Visualizing Component Graphs#
We can get more information about an instantiated component graph by calling .describe()
. This method will pretty-print each of the components in the graph and its parameters.
[7]:
# Using a more involved component graph with more complex edges
component_dict = {
"Imputer": ["Imputer", "X", "y"],
"Target Imputer": ["Target Imputer", "X", "y"],
"OneHot_RandomForest": ["One Hot Encoder", "Imputer.x", "Target Imputer.y"],
"OneHot_ElasticNet": ["One Hot Encoder", "Imputer.x", "y"],
"Random Forest": ["Random Forest Classifier", "OneHot_RandomForest.x", "y"],
"Elastic Net": [
"Elastic Net Classifier",
"OneHot_ElasticNet.x",
"Target Imputer.y",
],
"Logistic Regression": [
"Logistic Regression Classifier",
"Random Forest.x",
"Elastic Net.x",
"y",
],
}
cg_with_estimators = ComponentGraph(component_dict)
cg_with_estimators.instantiate({})
cg_with_estimators.describe()
1. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : mean
* boolean_impute_strategy : most_frequent
* categorical_fill_value : None
* numeric_fill_value : None
* boolean_fill_value : None
2. Target Imputer
* impute_strategy : most_frequent
* fill_value : None
3. One Hot Encoder
* top_n : 10
* features_to_encode : None
* categories : None
* drop : if_binary
* handle_unknown : ignore
* handle_missing : error
4. One Hot Encoder
* top_n : 10
* features_to_encode : None
* categories : None
* drop : if_binary
* handle_unknown : ignore
* handle_missing : error
5. Random Forest Classifier
* n_estimators : 100
* max_depth : 6
* n_jobs : -1
6. Elastic Net Classifier
* penalty : elasticnet
* C : 1.0
* l1_ratio : 0.15
* n_jobs : -1
* multi_class : auto
* solver : saga
7. Logistic Regression Classifier
* penalty : l2
* C : 1.0
* n_jobs : -1
* multi_class : auto
* solver : lbfgs
We can also visualize a component graph by calling .graph()
.
[8]:
cg_with_estimators.graph()
[8]:
Component graph methods#
Similar to the pipeline structure, we can call fit
, transform
or predict
.
We can also call fit_features
which will fit all but the final component and compute_final_component_features
which will transform all but the final component. These two methods may be useful in cases where you want to understand what transformed features are being passed into the last component.
[9]:
from evalml.demos import load_breast_cancer
X, y = load_breast_cancer()
component_dict = {
"My Imputer": ["Imputer", "X", "y"],
"OHE": ["One Hot Encoder", "My Imputer.x", "y"],
}
cg_with_final_transformer = ComponentGraph(component_dict)
cg_with_final_transformer.instantiate({})
cg_with_final_transformer.fit(X, y)
# We can call `transform` for ComponentGraphs with a final transformer
cg_with_final_transformer.transform(X, y)
Number of Features
Numeric 30
Number of training examples: 569
Targets
benign 62.74%
malignant 37.26%
Name: count, dtype: object
The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
[9]:
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | ... | 25.380 | 17.33 | 184.60 | 2019.0 | 0.16220 | 0.66560 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | 0.1812 | 0.05667 | ... | 24.990 | 23.41 | 158.80 | 1956.0 | 0.12380 | 0.18660 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | 0.2069 | 0.05999 | ... | 23.570 | 25.53 | 152.50 | 1709.0 | 0.14440 | 0.42450 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | 0.2597 | 0.09744 | ... | 14.910 | 26.50 | 98.87 | 567.7 | 0.20980 | 0.86630 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | 0.1809 | 0.05883 | ... | 22.540 | 16.67 | 152.20 | 1575.0 | 0.13740 | 0.20500 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
564 | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | 0.1726 | 0.05623 | ... | 25.450 | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 |
565 | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | 0.1752 | 0.05533 | ... | 23.690 | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 |
566 | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | 0.1590 | 0.05648 | ... | 18.980 | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 |
567 | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | 0.2397 | 0.07016 | ... | 25.740 | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 |
568 | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | 0.1587 | 0.05884 | ... | 9.456 | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 |
569 rows × 30 columns
[10]:
cg_with_estimators.fit(X, y)
# We can call `predict` for ComponentGraphs with a final transformer
cg_with_estimators.predict(X)
The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
[10]:
0 malignant
1 malignant
2 malignant
3 malignant
4 malignant
...
564 malignant
565 malignant
566 malignant
567 malignant
568 benign
Length: 569, dtype: category
Categories (2, object): ['benign', 'malignant']