utils#

Utility methods for EvalML components.

Module Contents#

Classes Summary#

WrappedSKClassifier

Scikit-learn classifier wrapper class.

WrappedSKRegressor

Scikit-learn regressor wrapper class.

Functions#

all_components

Get all available components.

allowed_model_families

List the model types allowed for a particular problem type.

convert_bool_to_double

Converts all boolean columns in dataframe to doubles. If include_ints, converts all integer columns to doubles as well.

estimator_unable_to_handle_nans

If True, provided estimator class is unable to handle NaN values as an input.

generate_component_code

Creates and returns a string that contains the Python imports and code required for running the EvalML component.

get_estimators

Returns the estimators allowed for a particular problem type.

get_prediction_intevals_for_tree_regressors

Find the prediction intervals for tree-based regressors.

handle_component_class

Standardizes input from a string name to a ComponentBase subclass if necessary.

handle_float_categories_for_catboost

Updates input data to be compatible with CatBoost estimators.

make_balancing_dictionary

Makes dictionary for oversampler components. Find ratio of each class to the majority. If the ratio is smaller than the sampling_ratio, we want to oversample, otherwise, we don't want to sample at all, and we leave the data as is.

match_indices

Matches index from the passed dataframe to the passed series.

scikit_learn_wrapped_estimator

Wraps an EvalML object as a scikit-learn estimator.

Contents#

evalml.pipelines.components.utils.all_components()[source]#

Get all available components.

evalml.pipelines.components.utils.allowed_model_families(problem_type)[source]#

List the model types allowed for a particular problem type.

Parameters

problem_type (ProblemTypes or str) – ProblemTypes enum or string.

Returns

A list of model families.

Return type

list[ModelFamily]

evalml.pipelines.components.utils.convert_bool_to_double(data: pandas.DataFrame, include_ints: bool = False) pandas.DataFrame[source]#

Converts all boolean columns in dataframe to doubles. If include_ints, converts all integer columns to doubles as well.

Parameters
  • data (pd.DataFrame) – Input dataframe.

  • include_ints (bool) – If True, converts all integer columns to doubles as well. Defaults to False.

Returns

Input dataframe with all boolean-valued columns converted to doubles.

Return type

pd.DataFrame

evalml.pipelines.components.utils.estimator_unable_to_handle_nans(estimator_class)[source]#

If True, provided estimator class is unable to handle NaN values as an input.

Parameters

estimator_class (Estimator) – Estimator class

Raises

ValueError – If estimator is not a valid estimator class.

Returns

True if estimator class is unable to process NaN values, False otherwise.

Return type

bool

evalml.pipelines.components.utils.generate_component_code(element)[source]#

Creates and returns a string that contains the Python imports and code required for running the EvalML component.

Parameters

element (component instance) – The instance of the component to generate string Python code for.

Returns

String representation of Python code that can be run separately in order to recreate the component instance. Does not include code for custom component implementation.

Raises

ValueError – If the input element is not a component instance.

Examples

>>> from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor
>>> assert generate_component_code(DecisionTreeRegressor()) == "from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor\n\ndecisionTreeRegressor = DecisionTreeRegressor(**{'criterion': 'squared_error', 'max_features': 'sqrt', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0})"
...
>>> from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer
>>> assert generate_component_code(SimpleImputer()) == "from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer\n\nsimpleImputer = SimpleImputer(**{'impute_strategy': 'most_frequent', 'fill_value': None})"
evalml.pipelines.components.utils.get_estimators(problem_type, model_families=None, excluded_model_families=None)[source]#

Returns the estimators allowed for a particular problem type.

Can also optionally filter by a list of model types.

Parameters
  • problem_type (ProblemTypes or str) – Problem type to filter for.

  • model_families (list(str, ModelFamily)) – Model families to filter for.

  • excluded_model_families (list(str, ModelFamily)) – A list of model families to exclude from the results.

Returns

A list of estimator subclasses.

Return type

list[class]

Raises
  • TypeError – If the model_families parameter is not a list.

  • RuntimeError – If a model family is not valid for the problem type.

evalml.pipelines.components.utils.get_prediction_intevals_for_tree_regressors(X: pandas.DataFrame, predictions: pandas.Series, coverage: List[float], estimators: List[evalml.pipelines.components.estimators.estimator.Estimator]) Dict[str, pandas.Series][source]#

Find the prediction intervals for tree-based regressors.

Parameters
  • X (pd.DataFrame) – Data of shape [n_samples, n_features].

  • predictions (pd.Series) – Predictions from the regressor.

  • coverage (list[float]) – A list of floats between the values 0 and 1 that the upper and lower bounds of the prediction interval should be calculated for.

  • estimators (list) – Collection of fitted sub-estimators.

Returns

Prediction intervals, keys are in the format {coverage}_lower or {coverage}_upper.

Return type

dict

evalml.pipelines.components.utils.handle_component_class(component_class)[source]#

Standardizes input from a string name to a ComponentBase subclass if necessary.

If a str is provided, will attempt to look up a ComponentBase class by that name and return a new instance. Otherwise if a ComponentBase subclass or Component instance is provided, will return that without modification.

Parameters

component_class (str, ComponentBase) – Input to be standardized.

Returns

ComponentBase

Raises
  • ValueError – If input is not a valid component class.

  • MissingComponentError – If the component cannot be found.

Examples

>>> from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor
>>> handle_component_class(DecisionTreeRegressor)
<class 'evalml.pipelines.components.estimators.regressors.decision_tree_regressor.DecisionTreeRegressor'>
>>> handle_component_class("Random Forest Regressor")
<class 'evalml.pipelines.components.estimators.regressors.rf_regressor.RandomForestRegressor'>
evalml.pipelines.components.utils.handle_float_categories_for_catboost(X)[source]#

Updates input data to be compatible with CatBoost estimators.

CatBoost cannot handle data in X that is the Categorical Woodwork logical type with floating point categories. This utility determines if the floating point categories can be converted to integers without truncating any data, and if they can be, converts them to int64 categories. Will not attempt to use values that are truly floating points.

Parameters

X (pd.DataFrame) – Input data to CatBoost that has Woodwork initialized

Returns

Input data with exact same Woodwork typing info as the original but with any float categories

converted to be int64 when possible.

Return type

DataFrame

Raises

ValueError – if the numeric categories are actual floats that cannot be converted to integers without truncating data

evalml.pipelines.components.utils.make_balancing_dictionary(y, sampling_ratio)[source]#

Makes dictionary for oversampler components. Find ratio of each class to the majority. If the ratio is smaller than the sampling_ratio, we want to oversample, otherwise, we don’t want to sample at all, and we leave the data as is.

Parameters
  • y (pd.Series) – Target data.

  • sampling_ratio (float) – The balanced ratio we want the samples to meet.

Returns

Dictionary where keys are the classes, and the corresponding values are the counts of samples for each class that will satisfy sampling_ratio.

Return type

dict

Raises

ValueError – If sampling ratio is not in the range (0, 1] or the target is empty.

Examples

>>> import pandas as pd
>>> y = pd.Series([1] * 4 + [2] * 8 + [3])
>>> assert make_balancing_dictionary(y, 0.5) == {2: 8, 1: 4, 3: 4}
>>> assert make_balancing_dictionary(y, 0.9) == {2: 8, 1: 7, 3: 7}
>>> assert make_balancing_dictionary(y, 0.1) == {2: 8, 1: 4, 3: 1}
evalml.pipelines.components.utils.match_indices(X: pandas.DataFrame, y: pandas.Series) Tuple[pandas.DataFrame, Union[pandas.Series, pandas.DataFrame]][source]#

Matches index from the passed dataframe to the passed series.

Parameters
  • X (pd.DataFrame) – Dataframe to match index from.

  • y (pd.Series) – Series to match the index to.

Returns: Tuple(pd.DataFrame, pd.Series): DataFrame and Series with matching indicies.

evalml.pipelines.components.utils.scikit_learn_wrapped_estimator(evalml_obj)[source]#

Wraps an EvalML object as a scikit-learn estimator.

class evalml.pipelines.components.utils.WrappedSKClassifier(pipeline)[source]#

Scikit-learn classifier wrapper class.

Methods

fit

Fits component to data.

get_metadata_routing

Get metadata routing of this object.

get_params

Get parameters for this estimator.

predict

Make predictions using selected features.

predict_proba

Make probability estimates for labels.

score

Return the mean accuracy on the given test data and labels.

set_params

Set the parameters of this estimator.

fit(self, X, y)[source]#

Fits component to data.

Parameters
  • X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features].

  • y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

get_metadata_routing(self)#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns

routing – A MetadataRequest encapsulating routing information.

Return type

MetadataRequest

get_params(self, deep=True)#

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(self, X)[source]#

Make predictions using selected features.

Parameters

X (pd.DataFrame) – Features

Returns

Predicted values.

Return type

np.ndarray

predict_proba(self, X)[source]#

Make probability estimates for labels.

Parameters

X (pd.DataFrame) – Features.

Returns

Probability estimates.

Return type

np.ndarray

score(self, X, y, sample_weight=None)#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type

float

set_params(self, **params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

class evalml.pipelines.components.utils.WrappedSKRegressor(pipeline)[source]#

Scikit-learn regressor wrapper class.

Methods

fit

Fits component to data.

get_metadata_routing

Get metadata routing of this object.

get_params

Get parameters for this estimator.

predict

Make predictions using selected features.

score

Return the coefficient of determination of the prediction.

set_params

Set the parameters of this estimator.

fit(self, X, y)[source]#

Fits component to data.

Parameters
  • X (pd.DataFrame or np.ndarray) – the input training data of shape [n_samples, n_features]

  • y (pd.Series, optional) – the target training data of length [n_samples]

Returns

self

get_metadata_routing(self)#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns

routing – A MetadataRequest encapsulating routing information.

Return type

MetadataRequest

get_params(self, deep=True)#

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(self, X)[source]#

Make predictions using selected features.

Parameters

X (pd.DataFrame) – Features.

Returns

Predicted values.

Return type

np.ndarray

score(self, X, y, sample_weight=None)#

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score\(R^2\) of self.predict(X) w.r.t. y.

Return type

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(self, **params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance