{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Component Graphs\n", "\n", "EvalML component graphs represent and describe the flow of data in a collection of related components. A component graph is comprised of nodes representing components, and edges between pairs of nodes representing where the inputs and outputs of each component should go. It is the backbone of the features offered by the EvalML [pipeline](pipelines.ipynb), but is also a powerful data structure on its own. EvalML currently supports component graphs as linear and [directed acyclic graphs (DAG)](https://en.wikipedia.org/wiki/Directed_acyclic_graph)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining a Component Graph\n", "\n", "Component graphs can be defined by specifying the dictionary of components and edges that describe the graph.\n", "\n", "In this dictionary, each key is a reference name for a component. Each corresponding value is a list, where the first element is the component itself, and the remaining elements are the input edges that should be connected to that component. The component as listed in the value can either be the component object itself or its string name.\n", "\n", "This stucture is very similar to that of [Dask computation graphs](https://docs.dask.org/en/latest/spec.html).\n", "\n", "\n", "For example, in the code example below, we have a simple component graph made up of two components: an Imputer and a Random Forest Classifer. The names used to reference these two components are given by the keys, \"My Imputer\" and \"RF Classifier\" respectively. Each value in the dictionary is a list where the first element is the component corresponding to the component name, and the remaining elements are the inputs, e.g. \"My Imputer\" represents an Imputer component which has inputs \"X\" (the original features matrix) and \"y\" (the original target).\n", "\n", "Feature edges are specified as `\"X\"` or `\"{component_name}.x\"`. For example, `{\"My Component\": [MyComponent, \"Imputer.x\", ...]}` indicates that we should use the feature output of the `Imputer` as as part of the feature input for MyComponent. Similarly, target edges are specified as `\"y\"` or `\"{component_name}.y\". {\"My Component\": [MyComponent, \"Target Imputer.y\", ...]}` indicates that we should use the target output of the `Target Imputer` as a target input for MyComponent.\n", "\n", "Each component can have a number of feature inputs, but can only have one target input. All input edges must be explicitly defined." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a real example, we define a simple component graph consisting of three nodes: an Imputer (\"My Imputer\"), an One-Hot Encoder (\"OHE\"), and a Random Forest Classifier (\"RF Classifier\"). \n", "\n", "- \"My Imputer\" takes the original X as a features input, and the original y as the target input\n", "- \"OHE\" also takes the original X as a features input, and the original y as the target input\n", "- \"RF Classifer\" takes the concatted feature outputs from \"My Imputer\" and \"OHE\" as a features input, and the original y as the target input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.pipelines import ComponentGraph\n", "\n", "component_dict = {\n", " \"My Imputer\": [\"Imputer\", \"X\", \"y\"],\n", " \"OHE\": [\"One Hot Encoder\", \"X\", \"y\"],\n", " \"RF Classifier\": [\n", " \"Random Forest Classifier\",\n", " \"My Imputer.x\",\n", " \"OHE.x\",\n", " \"y\",\n", " ], # takes in multiple feature inputs\n", "}\n", "cg_simple = ComponentGraph(component_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All component graphs must end with one final or terminus node. This can either be a transformer or an estimator. Below, the component graph is invalid because has two terminus nodes: the \"RF Classifier\" and the \"EN Classifier\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Can't instantiate a component graph with more than one terminus node (here: RF Classifier, EN Classifier)\n", "component_dict = {\n", " \"My Imputer\": [\"Imputer\", \"X\", \"y\"],\n", " \"RF Classifier\": [\"Random Forest Classifier\", \"My Imputer.x\", \"y\"],\n", " \"EN Classifier\": [\"Elastic Net Classifier\", \"My Imputer.x\", \"y\"],\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have defined a component graph, we can instantiate the graph with specific parameter values for each component using `.instantiate(parameters)`. All components in a component graph must be instantiated before fitting, transforming, or predicting.\n", "\n", "Below, we instantiate our graph and set the value of our Imputer's `numeric_impute_strategy` to \"most_frequent\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cg_simple.instantiate({\"My Imputer\": {\"numeric_impute_strategy\": \"most_frequent\"}})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Components in the Component Graph\n", "\n", "You can use `.get_component(name)` and provide the unique component name to access any component in the component graph. Below, we can grab our Imputer component and confirm that `numeric_impute_strategy` has indeed been set to \"most_frequent\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cg_simple.get_component(\"My Imputer\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also `.get_inputs(name)` and provide the unique component name to to retrieve all inputs for that component.\n", "\n", "Below, we can grab our \"RF Classifier\" component and confirm that we use `\"My Imputer.x\"` as our features input and `\"y\"` as target input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cg_simple.get_inputs(\"RF Classifier\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Component Graph Computation Order\n", "\n", "Upon initalization, each component graph will generate a topological order. We can access this generated order by calling the `.compute_order` attribute. This attribute is used to determine the order that components should be evaluated during calls to `fit` and `transform`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cg_simple.compute_order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing Component Graphs\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get more information about an instantiated component graph by calling `.describe()`. This method will pretty-print each of the components in the graph and its parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using a more involved component graph with more complex edges\n", "component_dict = {\n", " \"Imputer\": [\"Imputer\", \"X\", \"y\"],\n", " \"Target Imputer\": [\"Target Imputer\", \"X\", \"y\"],\n", " \"OneHot_RandomForest\": [\"One Hot Encoder\", \"Imputer.x\", \"Target Imputer.y\"],\n", " \"OneHot_ElasticNet\": [\"One Hot Encoder\", \"Imputer.x\", \"y\"],\n", " \"Random Forest\": [\"Random Forest Classifier\", \"OneHot_RandomForest.x\", \"y\"],\n", " \"Elastic Net\": [\n", " \"Elastic Net Classifier\",\n", " \"OneHot_ElasticNet.x\",\n", " \"Target Imputer.y\",\n", " ],\n", " \"Logistic Regression\": [\n", " \"Logistic Regression Classifier\",\n", " \"Random Forest.x\",\n", " \"Elastic Net.x\",\n", " \"y\",\n", " ],\n", "}\n", "cg_with_estimators = ComponentGraph(component_dict)\n", "cg_with_estimators.instantiate({})\n", "cg_with_estimators.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also visualize a component graph by calling `.graph()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cg_with_estimators.graph()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Component graph methods\n", "\n", "Similar to the pipeline structure, we can call `fit`, `transform` or `predict`. \n", "\n", "We can also call `fit_features` which will fit all but the final component and `compute_final_component_features` which will transform all but the final component. These two methods may be useful in cases where you want to understand what transformed features are being passed into the last component." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.demos import load_breast_cancer\n", "\n", "X, y = load_breast_cancer()\n", "component_dict = {\n", " \"My Imputer\": [\"Imputer\", \"X\", \"y\"],\n", " \"OHE\": [\"One Hot Encoder\", \"My Imputer.x\", \"y\"],\n", "}\n", "cg_with_final_transformer = ComponentGraph(component_dict)\n", "cg_with_final_transformer.instantiate({})\n", "cg_with_final_transformer.fit(X, y)\n", "\n", "# We can call `transform` for ComponentGraphs with a final transformer\n", "cg_with_final_transformer.transform(X, y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cg_with_estimators.fit(X, y)\n", "\n", "# We can call `predict` for ComponentGraphs with a final transformer\n", "cg_with_estimators.predict(X)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 4 }