{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Components" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Components are the lowest level of building blocks in EvalML. Each component represents a fundamental operation to be applied to data.\n", "\n", "All components accept parameters as keyword arguments to their `__init__` methods. These parameters can be used to configure behavior.\n", "\n", "Each component class definition must include a human-readable `name` for the component. Additionally, each component class may expose parameters for AutoML search by defining a `hyperparameter_ranges` attribute containing the parameters in question.\n", "\n", "EvalML splits components into two categories: **transformers** and **estimators**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transformers\n", "\n", "Transformers subclass the `Transformer` class, and define a `fit` method to learn information from training data and a `transform` method to apply a learned transformation to new data.\n", "\n", "For example, an [imputer](../autoapi/evalml/pipelines/components/index.rst#evalml.pipelines.components.SimpleImputer) is configured with the desired impute strategy to follow, for instance the mean value. The imputers `fit` method would learn the mean from the training data, and the `transform` method would fill the learned mean value in for any missing values in new data.\n", "\n", "All transformers can execute `fit` and `transform` separately or in one step by calling `fit_transform`. Defining a custom `fit_transform` method can facilitate useful performance optimizations in some cases." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from evalml.pipelines.components import SimpleImputer\n", "\n", "X = pd.DataFrame([[1, 2, 3], [1, np.nan, 3]])\n", "display(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import woodwork as ww\n", "\n", "imp = SimpleImputer(impute_strategy=\"mean\")\n", "\n", "X.ww.init()\n", "X = imp.fit_transform(X)\n", "display(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a list of all transformers included with EvalML:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.pipelines.components.utils import all_components, Estimator, Transformer\n", "\n", "for component in all_components():\n", " if issubclass(component, Transformer):\n", " print(f\"Transformer: {component.name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimators\n", "\n", "Each estimator wraps an ML algorithm. Estimators subclass the `Estimator` class, and define a `fit` method to learn information from training data and a `predict` method for generating predictions from new data. Classification estimators should also define a `predict_proba` method for generating predicted probabilities.\n", "\n", "Estimator classes each define a `model_family` attribute indicating what type of model is used.\n", "\n", "Here's an example of using the [LogisticRegressionClassifier](../autoapi/evalml/pipelines/index.rst#evalml.pipelines.LogisticRegressionClassifier) estimator to fit and predict on a simple dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.pipelines.components import LogisticRegressionClassifier\n", "\n", "clf = LogisticRegressionClassifier()\n", "\n", "X = X\n", "y = [1, 0]\n", "\n", "clf.fit(X, y)\n", "clf.predict(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a list of all estimators included with EvalML:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.pipelines.components.utils import all_components, Estimator, Transformer\n", "\n", "for component in all_components():\n", " if issubclass(component, Estimator):\n", " print(f\"Estimator: {component.name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining Custom Components\n", "\n", "EvalML allows you to easily create your own custom components by following the steps below.\n", "\n", "### Custom Transformers\n", "\n", "Your transformer must inherit from the correct subclass. In this case [Transformer](../autoapi/evalml/pipelines/index.rst#evalml.pipelines.Transformer) for components that transform data. Next we will use EvalML's [DropNullColumns](../autoapi/evalml/pipelines/components/index.rst#evalml.pipelines.components.DropNullColumns) as an example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.pipelines.components import Transformer\n", "from evalml.utils import (\n", " infer_feature_types,\n", ")\n", "\n", "\n", "class DropNullColumns(Transformer):\n", " \"\"\"Transformer to drop features whose percentage of NaN values exceeds a specified threshold\"\"\"\n", "\n", " name = \"Drop Null Columns Transformer\"\n", " hyperparameter_ranges = {}\n", "\n", " def __init__(self, pct_null_threshold=1.0, random_seed=0, **kwargs):\n", " \"\"\"Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.\n", "\n", " Args:\n", " pct_null_threshold(float): The percentage of NaN values in an input feature to drop.\n", " Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.\n", " If equal to 1.0, will drop columns with all null values. Defaults to 0.95.\n", " \"\"\"\n", " if pct_null_threshold < 0 or pct_null_threshold > 1:\n", " raise ValueError(\n", " \"pct_null_threshold must be a float between 0 and 1, inclusive.\"\n", " )\n", " parameters = {\"pct_null_threshold\": pct_null_threshold}\n", " parameters.update(kwargs)\n", "\n", " self._cols_to_drop = None\n", " super().__init__(\n", " parameters=parameters, component_obj=None, random_seed=random_seed\n", " )\n", "\n", " def fit(self, X, y=None):\n", " \"\"\"Fits DropNullColumns component to data\n", "\n", " Args:\n", " X (pd.DataFrame): The input training data of shape [n_samples, n_features]\n", " y (pd.Series, optional): The target training data of length [n_samples]\n", "\n", " Returns:\n", " self\n", " \"\"\"\n", " pct_null_threshold = self.parameters[\"pct_null_threshold\"]\n", " X_t = infer_feature_types(X)\n", " percent_null = X_t.isnull().mean()\n", " if pct_null_threshold == 0.0:\n", " null_cols = percent_null[percent_null > 0]\n", " else:\n", " null_cols = percent_null[percent_null >= pct_null_threshold]\n", " self._cols_to_drop = list(null_cols.index)\n", " return self\n", "\n", " def transform(self, X, y=None):\n", " \"\"\"Transforms data X by dropping columns that exceed the threshold of null values.\n", "\n", " Args:\n", " X (pd.DataFrame): Data to transform\n", " y (pd.Series, optional): Ignored.\n", "\n", " Returns:\n", " pd.DataFrame: Transformed X\n", " \"\"\"\n", " X_t = infer_feature_types(X)\n", " return X_t.drop(self._cols_to_drop)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Required fields\n", "\n", "- `name`: A human-readable name.\n", "\n", "- `modifies_features`: A boolean that specifies whether this component modifies (subsets or transforms) the features variable during `transform`.\n", "\n", "- `modifies_target`: A boolean that specifies whether this component modifies (subsets or transforms) the target variable during `transform`.\n", " \n", "#### Required methods\n", "\n", "Likewise, there are select methods you need to override as `Transformer` is an abstract base class:\n", "\n", "- `__init__()`: The `__init__()` method of your transformer will need to call `super().__init__()` and pass three parameters in: a `parameters` dictionary holding the parameters to the component, the `component_obj`, and the `random_seed` value. You can see that `component_obj` is set to `None` above and we will discuss `component_obj` in depth later on.\n", "\n", "- `fit()`: The `fit()` method is responsible for fitting your component on training data. It should return the component object.\n", "\n", "- `transform()`: After fitting a component, the `transform()` method will take in new data and transform accordingly. It should return a pandas dataframe with woodwork initialized. Note: a component must call `fit()` before `transform()`.\n", "\n", "You can also call or override `fit_transform()` that combines `fit()` and `transform()` into one method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Custom Estimators\n", "\n", "Your estimator must inherit from the correct subclass. In this case [Estimator](../autoapi/evalml/pipelines/index.rst#evalml.pipelines.Estimator) for components that predict new target values. Next we will use EvalML's [BaselineRegressor](../autoapi/evalml/pipelines/components/index.rst#evalml.pipelines.components.BaselineRegressor) as an example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from evalml.model_family import ModelFamily\n", "from evalml.pipelines.components.estimators import Estimator\n", "from evalml.problem_types import ProblemTypes\n", "\n", "\n", "class BaselineRegressor(Estimator):\n", " \"\"\"Regressor that predicts using the specified strategy.\n", "\n", " This is useful as a simple baseline regressor to compare with other regressors.\n", " \"\"\"\n", "\n", " name = \"Baseline Regressor\"\n", " hyperparameter_ranges = {}\n", " model_family = ModelFamily.BASELINE\n", " supported_problem_types = [\n", " ProblemTypes.REGRESSION,\n", " ProblemTypes.TIME_SERIES_REGRESSION,\n", " ]\n", "\n", " def __init__(self, strategy=\"mean\", random_seed=0, **kwargs):\n", " \"\"\"Baseline regressor that uses a simple strategy to make predictions.\n", "\n", " Args:\n", " strategy (str): Method used to predict. Valid options are \"mean\", \"median\". Defaults to \"mean\".\n", " random_seed (int): Seed for the random number generator. Defaults to 0.\n", "\n", " \"\"\"\n", " if strategy not in [\"mean\", \"median\"]:\n", " raise ValueError(\n", " \"'strategy' parameter must equal either 'mean' or 'median'\"\n", " )\n", " parameters = {\"strategy\": strategy}\n", " parameters.update(kwargs)\n", "\n", " self._prediction_value = None\n", " self._num_features = None\n", " super().__init__(\n", " parameters=parameters, component_obj=None, random_seed=random_seed\n", " )\n", "\n", " def fit(self, X, y=None):\n", " if y is None:\n", " raise ValueError(\"Cannot fit Baseline regressor if y is None\")\n", " X = infer_feature_types(X)\n", " y = infer_feature_types(y)\n", "\n", " if self.parameters[\"strategy\"] == \"mean\":\n", " self._prediction_value = y.mean()\n", " elif self.parameters[\"strategy\"] == \"median\":\n", " self._prediction_value = y.median()\n", " self._num_features = X.shape[1]\n", " return self\n", "\n", " def predict(self, X):\n", " X = infer_feature_types(X)\n", " predictions = pd.Series([self._prediction_value] * len(X))\n", " return infer_feature_types(predictions)\n", "\n", " @property\n", " def feature_importance(self):\n", " \"\"\"Returns importance associated with each feature. Since baseline regressors do not use input features to calculate predictions, returns an array of zeroes.\n", "\n", " Returns:\n", " np.ndarray (float): An array of zeroes\n", "\n", " \"\"\"\n", " return np.zeros(self._num_features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Required fields\n", "\n", "- `name`: A human-readable name.\n", "\n", "- `model_family` - EvalML [model_family](../autoapi/evalml/model_family/index.rst#evalml.model_family.ModelFamily) that this component belongs to\n", "\n", "- `supported_problem_types` - list of EvalML [problem_types](../autoapi/evalml/problem_types/index.rst#evalml.problem_types.ProblemTypes) that this component supports\n", "- `modifies_features`: A boolean that specifies whether the return value from `predict` or `predict_proba` should be used as features.\n", "\n", "- `modifies_target`: A boolean that specifies whether the return value from `predict` or `predict_proba` should be used as the target variable.\n", "\n", "Model families and problem types include:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.model_family import ModelFamily\n", "from evalml.problem_types import ProblemTypes\n", "\n", "print(\"Model Families:\\n\", [m.value for m in ModelFamily])\n", "print(\"Problem Types:\\n\", [p.value for p in ProblemTypes])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Required methods\n", "\n", "- `__init__()` - the `__init__()` method of your estimator will need to call `super().__init__()` and pass three parameters in: a `parameters` dictionary holding the parameters to the component, the `component_obj`, and the `random_seed` value.\n", "\n", "- `fit()` - the `fit()` method is responsible for fitting your component on training data.\n", "\n", "- `predict()` - after fitting a component, the `predict()` method will take in new data and predict new target values. Note: a component must call `fit()` before `predict()`.\n", "\n", "- `feature_importance` - `feature_importance` is a [Python property](https://docs.python.org/3/library/functions.html#property) that returns a list of importances associated with each feature.\n", "\n", "If your estimator handles classification problems it also requires an additonal method:\n", "\n", "- `predict_proba()` - this method predicts probability estimates for classification labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Components Wrapping Third-Party Objects\n", "\n", "The `component_obj` parameter is used for wrapping third-party objects and using them in component implementation. If you're using a `component_obj` you will need to define `__init__()` and pass in the relevant object that has also implemented the required methods mentioned above. However, if the `component_obj` does not follow EvalML component conventions, you may need to override methods as needed. Below is an example of EvalML's [LinearRegressor](../autoapi/evalml/pipelines/index.rst#evalml.pipelines.LinearRegressor)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression as SKLinearRegression\n", "\n", "from evalml.model_family import ModelFamily\n", "from evalml.pipelines.components.estimators import Estimator\n", "from evalml.problem_types import ProblemTypes\n", "\n", "\n", "class LinearRegressor(Estimator):\n", " \"\"\"Linear Regressor.\"\"\"\n", "\n", " name = \"Linear Regressor\"\n", " model_family = ModelFamily.LINEAR_MODEL\n", " supported_problem_types = [ProblemTypes.REGRESSION]\n", "\n", " def __init__(\n", " self, fit_intercept=True, normalize=False, n_jobs=-1, random_seed=0, **kwargs\n", " ):\n", " parameters = {\n", " \"fit_intercept\": fit_intercept,\n", " \"normalize\": normalize,\n", " \"n_jobs\": n_jobs,\n", " }\n", " parameters.update(kwargs)\n", " linear_regressor = SKLinearRegression(**parameters)\n", " super().__init__(\n", " parameters=parameters,\n", " component_obj=linear_regressor,\n", " random_seed=random_seed,\n", " )\n", "\n", " @property\n", " def feature_importance(self):\n", " return self._component_obj.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hyperparameter Ranges for AutoML\n", "`hyperparameter_ranges` is a dictionary mapping the parameter name (str) to an allowed range ([SkOpt Space](https://scikit-optimize.github.io/stable/modules/classes.html#module-skopt.space.space)) for that parameter. Both lists and `skopt.space.Categorical` values are accepted for categorical spaces. \n", "\n", "AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component's `hyperparameter_ranges` class attribute. Any component parameter you add an entry for in `hyperparameter_ranges` will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate Component Code\n", "\n", "Once you have a component defined in EvalML, you can generate string Python code to recreate this component, which can then be saved and run elsewhere with EvalML. `generate_component_code` requires a component instance as the input. This method works for custom components as well, although it won't return the code required to define the custom component. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from evalml.pipelines.components import LogisticRegressionClassifier\n", "from evalml.pipelines.components.utils import generate_component_code\n", "\n", "lr = LogisticRegressionClassifier(C=5)\n", "code = generate_component_code(lr)\n", "print(code)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this string can then be copy and pasted into a separate window and executed as python code\n", "exec(code)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can also do this for custom components\n", "from evalml.pipelines.components.utils import generate_component_code\n", "\n", "myDropNull = DropNullColumns()\n", "print(generate_component_code(myDropNull))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Expectations for Custom Classification Components\n", "EvalML expects the following from custom classification component implementations:\n", "\n", "- Classification targets will range from 0 to n-1 and are integers.\n", "- For classification estimators, the order of predict_proba's columns must match the order of the target, and the column names must be integers ranging from 0 to n-1" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 4 }