Components#
Components are the lowest level of building blocks in EvalML. Each component represents a fundamental operation to be applied to data.
All components accept parameters as keyword arguments to their __init__
methods. These parameters can be used to configure behavior.
Each component class definition must include a human-readable name
for the component. Additionally, each component class may expose parameters for AutoML search by defining a hyperparameter_ranges
attribute containing the parameters in question.
EvalML splits components into two categories: transformers and estimators.
Transformers#
Transformers subclass the Transformer
class, and define a fit
method to learn information from training data and a transform
method to apply a learned transformation to new data.
For example, an imputer is configured with the desired impute strategy to follow, for instance the mean value. The imputers fit
method would learn the mean from the training data, and the transform
method would fill the learned mean value in for any missing values in new data.
All transformers can execute fit
and transform
separately or in one step by calling fit_transform
. Defining a custom fit_transform
method can facilitate useful performance optimizations in some cases.
[1]:
import numpy as np
import pandas as pd
from evalml.pipelines.components import SimpleImputer
X = pd.DataFrame([[1, 2, 3], [1, np.nan, 3]])
display(X)
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2.0 | 3 |
1 | 1 | NaN | 3 |
[2]:
import woodwork as ww
imp = SimpleImputer(impute_strategy="mean")
X.ww.init()
X = imp.fit_transform(X)
display(X)
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2.0 | 3 |
1 | 1 | 2.0 | 3 |
Below is a list of all transformers included with EvalML:
[3]:
from evalml.pipelines.components.utils import all_components, Estimator, Transformer
for component in all_components():
if issubclass(component, Transformer):
print(f"Transformer: {component.name}")
Transformer: Time Series Regularizer
Transformer: Drop NaN Rows Transformer
Transformer: Replace Nullable Types Transformer
Transformer: Drop Rows Transformer
Transformer: URL Featurizer
Transformer: Email Featurizer
Transformer: Log Transformer
Transformer: STL Decomposer
Transformer: Polynomial Decomposer
Transformer: DFS Transformer
Transformer: Time Series Featurizer
Transformer: Natural Language Featurizer
Transformer: LSA Transformer
Transformer: Drop Null Columns Transformer
Transformer: DateTime Featurizer
Transformer: PCA Transformer
Transformer: Linear Discriminant Analysis Transformer
Transformer: Select Columns By Type Transformer
Transformer: Select Columns Transformer
Transformer: Drop Columns Transformer
Transformer: Oversampler
Transformer: Undersampler
Transformer: Standard Scaler
Transformer: Time Series Imputer
Transformer: Target Imputer
Transformer: Imputer
Transformer: KNN Imputer
Transformer: Per Column Imputer
Transformer: Simple Imputer
Transformer: RFE Selector with RF Regressor
Transformer: RFE Selector with RF Classifier
Transformer: RF Regressor Select From Model
Transformer: RF Classifier Select From Model
Transformer: Ordinal Encoder
Transformer: Label Encoder
Transformer: Target Encoder
Transformer: One Hot Encoder
Estimators#
Each estimator wraps an ML algorithm. Estimators subclass the Estimator
class, and define a fit
method to learn information from training data and a predict
method for generating predictions from new data. Classification estimators should also define a predict_proba
method for generating predicted probabilities.
Estimator classes each define a model_family
attribute indicating what type of model is used.
Here’s an example of using the LogisticRegressionClassifier estimator to fit and predict on a simple dataset:
[4]:
from evalml.pipelines.components import LogisticRegressionClassifier
clf = LogisticRegressionClassifier()
X = X
y = [1, 0]
clf.fit(X, y)
clf.predict(X)
[4]:
0 0
1 0
dtype: int64
Below is a list of all estimators included with EvalML:
[5]:
from evalml.pipelines.components.utils import all_components, Estimator, Transformer
for component in all_components():
if issubclass(component, Estimator):
print(f"Estimator: {component.name}")
Estimator: Stacked Ensemble Regressor
Estimator: Stacked Ensemble Classifier
Estimator: Vowpal Wabbit Regressor
Estimator: VARMAX Regressor
Estimator: ARIMA Regressor
Estimator: Exponential Smoothing Regressor
Estimator: SVM Regressor
Estimator: Prophet Regressor
Estimator: Multiseries Time Series Baseline Regressor
Estimator: Time Series Baseline Estimator
Estimator: Decision Tree Regressor
Estimator: Baseline Regressor
Estimator: Extra Trees Regressor
Estimator: XGBoost Regressor
Estimator: CatBoost Regressor
Estimator: Random Forest Regressor
Estimator: LightGBM Regressor
Estimator: Linear Regressor
Estimator: Elastic Net Regressor
Estimator: Vowpal Wabbit Multiclass Classifier
Estimator: Vowpal Wabbit Binary Classifier
Estimator: SVM Classifier
Estimator: KNN Classifier
Estimator: Decision Tree Classifier
Estimator: LightGBM Classifier
Estimator: Baseline Classifier
Estimator: Extra Trees Classifier
Estimator: Elastic Net Classifier
Estimator: CatBoost Classifier
Estimator: XGBoost Classifier
Estimator: Random Forest Classifier
Estimator: Logistic Regression Classifier
Defining Custom Components#
EvalML allows you to easily create your own custom components by following the steps below.
Custom Transformers#
Your transformer must inherit from the correct subclass. In this case Transformer for components that transform data. Next we will use EvalML’s DropNullColumns as an example.
[6]:
from evalml.pipelines.components import Transformer
from evalml.utils import (
infer_feature_types,
)
class DropNullColumns(Transformer):
"""Transformer to drop features whose percentage of NaN values exceeds a specified threshold"""
name = "Drop Null Columns Transformer"
hyperparameter_ranges = {}
def __init__(self, pct_null_threshold=1.0, random_seed=0, **kwargs):
"""Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.
Args:
pct_null_threshold(float): The percentage of NaN values in an input feature to drop.
Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.
If equal to 1.0, will drop columns with all null values. Defaults to 0.95.
"""
if pct_null_threshold < 0 or pct_null_threshold > 1:
raise ValueError(
"pct_null_threshold must be a float between 0 and 1, inclusive."
)
parameters = {"pct_null_threshold": pct_null_threshold}
parameters.update(kwargs)
self._cols_to_drop = None
super().__init__(
parameters=parameters, component_obj=None, random_seed=random_seed
)
def fit(self, X, y=None):
"""Fits DropNullColumns component to data
Args:
X (pd.DataFrame): The input training data of shape [n_samples, n_features]
y (pd.Series, optional): The target training data of length [n_samples]
Returns:
self
"""
pct_null_threshold = self.parameters["pct_null_threshold"]
X_t = infer_feature_types(X)
percent_null = X_t.isnull().mean()
if pct_null_threshold == 0.0:
null_cols = percent_null[percent_null > 0]
else:
null_cols = percent_null[percent_null >= pct_null_threshold]
self._cols_to_drop = list(null_cols.index)
return self
def transform(self, X, y=None):
"""Transforms data X by dropping columns that exceed the threshold of null values.
Args:
X (pd.DataFrame): Data to transform
y (pd.Series, optional): Ignored.
Returns:
pd.DataFrame: Transformed X
"""
X_t = infer_feature_types(X)
return X_t.drop(self._cols_to_drop)
Required fields#
name
: A human-readable name.modifies_features
: A boolean that specifies whether this component modifies (subsets or transforms) the features variable duringtransform
.modifies_target
: A boolean that specifies whether this component modifies (subsets or transforms) the target variable duringtransform
.
Required methods#
Likewise, there are select methods you need to override as Transformer
is an abstract base class:
__init__()
: The__init__()
method of your transformer will need to callsuper().__init__()
and pass three parameters in: aparameters
dictionary holding the parameters to the component, thecomponent_obj
, and therandom_seed
value. You can see thatcomponent_obj
is set toNone
above and we will discusscomponent_obj
in depth later on.fit()
: Thefit()
method is responsible for fitting your component on training data. It should return the component object.transform()
: After fitting a component, thetransform()
method will take in new data and transform accordingly. It should return a pandas dataframe with woodwork initialized. Note: a component must callfit()
beforetransform()
.
You can also call or override fit_transform()
that combines fit()
and transform()
into one method.
Custom Estimators#
Your estimator must inherit from the correct subclass. In this case Estimator for components that predict new target values. Next we will use EvalML’s BaselineRegressor as an example.
[7]:
import numpy as np
import pandas as pd
from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes
class BaselineRegressor(Estimator):
"""Regressor that predicts using the specified strategy.
This is useful as a simple baseline regressor to compare with other regressors.
"""
name = "Baseline Regressor"
hyperparameter_ranges = {}
model_family = ModelFamily.BASELINE
supported_problem_types = [
ProblemTypes.REGRESSION,
ProblemTypes.TIME_SERIES_REGRESSION,
]
def __init__(self, strategy="mean", random_seed=0, **kwargs):
"""Baseline regressor that uses a simple strategy to make predictions.
Args:
strategy (str): Method used to predict. Valid options are "mean", "median". Defaults to "mean".
random_seed (int): Seed for the random number generator. Defaults to 0.
"""
if strategy not in ["mean", "median"]:
raise ValueError(
"'strategy' parameter must equal either 'mean' or 'median'"
)
parameters = {"strategy": strategy}
parameters.update(kwargs)
self._prediction_value = None
self._num_features = None
super().__init__(
parameters=parameters, component_obj=None, random_seed=random_seed
)
def fit(self, X, y=None):
if y is None:
raise ValueError("Cannot fit Baseline regressor if y is None")
X = infer_feature_types(X)
y = infer_feature_types(y)
if self.parameters["strategy"] == "mean":
self._prediction_value = y.mean()
elif self.parameters["strategy"] == "median":
self._prediction_value = y.median()
self._num_features = X.shape[1]
return self
def predict(self, X):
X = infer_feature_types(X)
predictions = pd.Series([self._prediction_value] * len(X))
return infer_feature_types(predictions)
@property
def feature_importance(self):
"""Returns importance associated with each feature. Since baseline regressors do not use input features to calculate predictions, returns an array of zeroes.
Returns:
np.ndarray (float): An array of zeroes
"""
return np.zeros(self._num_features)
Required fields#
name
: A human-readable name.model_family
- EvalML model_family that this component belongs tosupported_problem_types
- list of EvalML problem_types that this component supportsmodifies_features
: A boolean that specifies whether the return value frompredict
orpredict_proba
should be used as features.modifies_target
: A boolean that specifies whether the return value frompredict
orpredict_proba
should be used as the target variable.
Model families and problem types include:
[8]:
from evalml.model_family import ModelFamily
from evalml.problem_types import ProblemTypes
print("Model Families:\n", [m.value for m in ModelFamily])
print("Problem Types:\n", [p.value for p in ProblemTypes])
Model Families:
['k_neighbors', 'random_forest', 'svm', 'xgboost', 'lightgbm', 'linear_model', 'catboost', 'extra_trees', 'ensemble', 'decision_tree', 'exponential_smoothing', 'arima', 'varmax', 'baseline', 'prophet', 'vowpal_wabbit', 'none']
Problem Types:
['binary', 'multiclass', 'regression', 'time series regression', 'time series binary', 'time series multiclass', 'multiseries time series regression']
Required methods#
__init__()
- the__init__()
method of your estimator will need to callsuper().__init__()
and pass three parameters in: aparameters
dictionary holding the parameters to the component, thecomponent_obj
, and therandom_seed
value.fit()
- thefit()
method is responsible for fitting your component on training data.predict()
- after fitting a component, thepredict()
method will take in new data and predict new target values. Note: a component must callfit()
beforepredict()
.feature_importance
-feature_importance
is a Python property that returns a list of importances associated with each feature.
If your estimator handles classification problems it also requires an additonal method:
predict_proba()
- this method predicts probability estimates for classification labels
Components Wrapping Third-Party Objects#
The component_obj
parameter is used for wrapping third-party objects and using them in component implementation. If you’re using a component_obj
you will need to define __init__()
and pass in the relevant object that has also implemented the required methods mentioned above. However, if the component_obj
does not follow EvalML component conventions, you may need to override methods as needed. Below is an example of EvalML’s
LinearRegressor.
[9]:
from sklearn.linear_model import LinearRegression as SKLinearRegression
from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes
class LinearRegressor(Estimator):
"""Linear Regressor."""
name = "Linear Regressor"
model_family = ModelFamily.LINEAR_MODEL
supported_problem_types = [ProblemTypes.REGRESSION]
def __init__(
self, fit_intercept=True, normalize=False, n_jobs=-1, random_seed=0, **kwargs
):
parameters = {
"fit_intercept": fit_intercept,
"normalize": normalize,
"n_jobs": n_jobs,
}
parameters.update(kwargs)
linear_regressor = SKLinearRegression(**parameters)
super().__init__(
parameters=parameters,
component_obj=linear_regressor,
random_seed=random_seed,
)
@property
def feature_importance(self):
return self._component_obj.coef_
Hyperparameter Ranges for AutoML#
hyperparameter_ranges
is a dictionary mapping the parameter name (str) to an allowed range (SkOpt Space) for that parameter. Both lists and skopt.space.Categorical
values are accepted for categorical spaces.
AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component’s hyperparameter_ranges
class attribute. Any component parameter you add an entry for in hyperparameter_ranges
will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines.
Generate Component Code#
Once you have a component defined in EvalML, you can generate string Python code to recreate this component, which can then be saved and run elsewhere with EvalML. generate_component_code
requires a component instance as the input. This method works for custom components as well, although it won’t return the code required to define the custom component.
[10]:
from evalml.pipelines.components import LogisticRegressionClassifier
from evalml.pipelines.components.utils import generate_component_code
lr = LogisticRegressionClassifier(C=5)
code = generate_component_code(lr)
print(code)
from evalml.pipelines.components.estimators.classifiers.logistic_regression_classifier import LogisticRegressionClassifier
logisticRegressionClassifier = LogisticRegressionClassifier(**{'penalty': 'l2', 'C': 5, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'})
[11]:
# this string can then be copy and pasted into a separate window and executed as python code
exec(code)
[12]:
# We can also do this for custom components
from evalml.pipelines.components.utils import generate_component_code
myDropNull = DropNullColumns()
print(generate_component_code(myDropNull))
dropNullColumnsTransformer = DropNullColumns(**{'pct_null_threshold': 1.0})
Expectations for Custom Classification Components#
EvalML expects the following from custom classification component implementations:
Classification targets will range from 0 to n-1 and are integers.
For classification estimators, the order of predict_proba’s columns must match the order of the target, and the column names must be integers ranging from 0 to n-1