utils#
Utility methods for EvalML components.
Module Contents#
Classes Summary#
Scikit-learn classifier wrapper class. |
|
Scikit-learn regressor wrapper class. |
Functions#
Get all available components. |
|
List the model types allowed for a particular problem type. |
|
Drops natural language columns from dataframes for the imputers. |
|
If True, provided estimator class is unable to handle NaN values as an input. |
|
Creates and returns a string that contains the Python imports and code required for running the EvalML component. |
|
Returns the estimators allowed for a particular problem type. |
|
Standardizes input from a string name to a ComponentBase subclass if necessary. |
|
Makes dictionary for oversampler components. Find ratio of each class to the majority. If the ratio is smaller than the sampling_ratio, we want to oversample, otherwise, we don't want to sample at all, and we leave the data as is. |
|
Wraps an EvalML object as a scikit-learn estimator. |
|
Sets boolean columns to categorical for the imputer. |
Contents#
- evalml.pipelines.components.utils.allowed_model_families(problem_type)[source]#
List the model types allowed for a particular problem type.
- Parameters
problem_type (ProblemTypes or str) – ProblemTypes enum or string.
- Returns
A list of model families.
- Return type
list[ModelFamily]
- evalml.pipelines.components.utils.drop_natural_language_columns(X)[source]#
Drops natural language columns from dataframes for the imputers.
- Parameters
X (pd.Dataframe) – The dataframe that we want to impute on.
- Returns
the dataframe with any natural language columns dropped. list: list of all the columns that are considered natural language.
- Return type
pd.Dataframe
- evalml.pipelines.components.utils.estimator_unable_to_handle_nans(estimator_class)[source]#
If True, provided estimator class is unable to handle NaN values as an input.
- Parameters
estimator_class (Estimator) – Estimator class
- Raises
ValueError – If estimator is not a valid estimator class.
- Returns
True if estimator class is unable to process NaN values, False otherwise.
- Return type
bool
- evalml.pipelines.components.utils.generate_component_code(element)[source]#
Creates and returns a string that contains the Python imports and code required for running the EvalML component.
- Parameters
element (component instance) – The instance of the component to generate string Python code for.
- Returns
String representation of Python code that can be run separately in order to recreate the component instance. Does not include code for custom component implementation.
- Raises
ValueError – If the input element is not a component instance.
Examples
>>> from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor >>> assert generate_component_code(DecisionTreeRegressor()) == "from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor\n\ndecisionTreeRegressor = DecisionTreeRegressor(**{'criterion': 'mse', 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0})" ... >>> from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer >>> assert generate_component_code(SimpleImputer()) == "from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer\n\nsimpleImputer = SimpleImputer(**{'impute_strategy': 'most_frequent', 'fill_value': None})"
- evalml.pipelines.components.utils.get_estimators(problem_type, model_families=None)[source]#
Returns the estimators allowed for a particular problem type.
Can also optionally filter by a list of model types.
- Parameters
problem_type (ProblemTypes or str) – Problem type to filter for.
model_families (list[ModelFamily] or list[str]) – Model families to filter for.
- Returns
A list of estimator subclasses.
- Return type
list[class]
- Raises
TypeError – If the model_families parameter is not a list.
RuntimeError – If a model family is not valid for the problem type.
- evalml.pipelines.components.utils.handle_component_class(component_class)[source]#
Standardizes input from a string name to a ComponentBase subclass if necessary.
If a str is provided, will attempt to look up a ComponentBase class by that name and return a new instance. Otherwise if a ComponentBase subclass or Component instance is provided, will return that without modification.
- Parameters
component_class (str, ComponentBase) – Input to be standardized.
- Returns
ComponentBase
- Raises
ValueError – If input is not a valid component class.
MissingComponentError – If the component cannot be found.
Examples
>>> from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor >>> handle_component_class(DecisionTreeRegressor) <class 'evalml.pipelines.components.estimators.regressors.decision_tree_regressor.DecisionTreeRegressor'> >>> handle_component_class("Random Forest Regressor") <class 'evalml.pipelines.components.estimators.regressors.rf_regressor.RandomForestRegressor'>
- evalml.pipelines.components.utils.make_balancing_dictionary(y, sampling_ratio)[source]#
Makes dictionary for oversampler components. Find ratio of each class to the majority. If the ratio is smaller than the sampling_ratio, we want to oversample, otherwise, we don’t want to sample at all, and we leave the data as is.
- Parameters
y (pd.Series) – Target data.
sampling_ratio (float) – The balanced ratio we want the samples to meet.
- Returns
Dictionary where keys are the classes, and the corresponding values are the counts of samples for each class that will satisfy sampling_ratio.
- Return type
dict
- Raises
ValueError – If sampling ratio is not in the range (0, 1] or the target is empty.
Examples
>>> import pandas as pd >>> y = pd.Series([1] * 4 + [2] * 8 + [3]) >>> assert make_balancing_dictionary(y, 0.5) == {2: 8, 1: 4, 3: 4} >>> assert make_balancing_dictionary(y, 0.9) == {2: 8, 1: 7, 3: 7} >>> assert make_balancing_dictionary(y, 0.1) == {2: 8, 1: 4, 3: 1}
- evalml.pipelines.components.utils.scikit_learn_wrapped_estimator(evalml_obj)[source]#
Wraps an EvalML object as a scikit-learn estimator.
- evalml.pipelines.components.utils.set_boolean_columns_to_categorical(X)[source]#
Sets boolean columns to categorical for the imputer.
- Parameters
X (pd.Dataframe) – The dataframe that we want to impute on.
- Returns
the dataframe with any of its ww columns that are boolean set to categorical.
- Return type
pd.Dataframe
- class evalml.pipelines.components.utils.WrappedSKClassifier(pipeline)[source]#
Scikit-learn classifier wrapper class.
Methods
Fits component to data.
Get parameters for this estimator.
Make predictions using selected features.
Make probability estimates for labels.
Return the mean accuracy on the given test data and labels.
Set the parameters of this estimator.
- fit(self, X, y)[source]#
Fits component to data.
- Parameters
X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features].
y (pd.Series, optional) – The target training data of length [n_samples].
- Returns
self
- get_params(self, deep=True)#
Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- predict(self, X)[source]#
Make predictions using selected features.
- Parameters
X (pd.DataFrame) – Features
- Returns
Predicted values.
- Return type
np.ndarray
- predict_proba(self, X)[source]#
Make probability estimates for labels.
- Parameters
X (pd.DataFrame) – Features.
- Returns
Probability estimates.
- Return type
np.ndarray
- score(self, X, y, sample_weight=None)#
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – Mean accuracy of
self.predict(X)
wrt. y.- Return type
float
- set_params(self, **params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- class evalml.pipelines.components.utils.WrappedSKRegressor(pipeline)[source]#
Scikit-learn regressor wrapper class.
Methods
Fits component to data.
Get parameters for this estimator.
Make predictions using selected features.
Return the coefficient of determination of the prediction.
Set the parameters of this estimator.
- fit(self, X, y)[source]#
Fits component to data.
- Parameters
X (pd.DataFrame or np.ndarray) – the input training data of shape [n_samples, n_features]
y (pd.Series, optional) – the target training data of length [n_samples]
- Returns
self
- get_params(self, deep=True)#
Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- predict(self, X)[source]#
Make predictions using selected features.
- Parameters
X (pd.DataFrame) – Features.
- Returns
Predicted values.
- Return type
np.ndarray
- score(self, X, y, sample_weight=None)#
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – \(R^2\) of
self.predict(X)
wrt. y.- Return type
float
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_params(self, **params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance