preprocessing#

Preprocessing transformer components.

Package Contents#

Classes Summary#

DateTimeFeaturizer

Transformer that can automatically extract features from datetime columns.

Decomposer

Component that removes trends and seasonality from time series and returns the decomposed components.

DFSTransformer

Featuretools DFS component that generates features for the input features.

DropNaNRowsTransformer

Transformer to drop rows with NaN values.

DropNullColumns

Transformer to drop features whose percentage of NaN values exceeds a specified threshold.

DropRowsTransformer

Transformer to drop rows specified by row indices.

EmailFeaturizer

Transformer that can automatically extract features from emails.

LogTransformer

Applies a log transformation to the target data.

LSA

Transformer to calculate the Latent Semantic Analysis Values of text input.

NaturalLanguageFeaturizer

Transformer that can automatically featurize text columns using featuretools' nlp_primitives.

PolynomialDecomposer

Removes trends and seasonality from time series by fitting a polynomial and moving average to the data.

ReplaceNullableTypes

Transformer to replace features with the new nullable dtypes with a dtype that is compatible in EvalML.

STLDecomposer

Removes trends and seasonality from time series using the STL algorithm.

TextTransformer

Base class for all transformers working with text features.

TimeSeriesFeaturizer

Transformer that delays input features and target variable for time series problems.

TimeSeriesRegularizer

Transformer that regularizes an inconsistently spaced datetime column.

URLFeaturizer

Transformer that can automatically extract features from URL.

Contents#

class evalml.pipelines.components.transformers.preprocessing.DateTimeFeaturizer(features_to_extract=None, encode_as_categories=False, time_index=None, random_seed=0, **kwargs)[source]#

Transformer that can automatically extract features from datetime columns.

Parameters
  • features_to_extract (list) – List of features to extract. Valid options include “year”, “month”, “day_of_week”, “hour”. Defaults to None.

  • encode_as_categories (bool) – Whether day-of-week and month features should be encoded as pandas “category” dtype. This allows OneHotEncoders to encode these features. Defaults to False.

  • time_index (str) – Name of the column containing the datetime information used to order the data. Ignored.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

False

name

DateTime Featurizer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fit the datetime featurizer component.

fit_transform

Fits on X and transforms X.

get_feature_names

Gets the categories of each datetime feature.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data X by creating new features using existing DateTime columns, and then dropping those DateTime columns.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fit the datetime featurizer component.

Parameters
  • X (pd.DataFrame) – Input features.

  • y (pd.Series, optional) – Target data. Ignored.

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

get_feature_names(self)[source]#

Gets the categories of each datetime feature.

Returns

Dictionary, where each key-value pair is a column name and a dictionary

mapping the unique feature values to their integer encoding.

Return type

dict

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data X by creating new features using existing DateTime columns, and then dropping those DateTime columns.

Parameters
  • X (pd.DataFrame) – Input features.

  • y (pd.Series, optional) – Ignored.

Returns

Transformed X

Return type

pd.DataFrame

class evalml.pipelines.components.transformers.preprocessing.Decomposer(component_obj=None, random_seed: int = 0, degree: int = 1, seasonal_period: int = - 1, time_index: str = None, **kwargs)[source]#

Component that removes trends and seasonality from time series and returns the decomposed components.

Parameters
  • parameters (dict) – Dictionary of parameters to pass to component object.

  • component_obj (class) – Instance of a detrender/deseasonalizer class.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

  • degree (int) – Currently the degree of the PolynomialDecomposer, not used for STLDecomposer.

  • seasonal_period (int) – The best guess, in units, for the period of the seasonal signal.

  • time_index (str) – The column name of the feature matrix (X) that the datetime information should be pulled from.

Attributes

hyperparameter_ranges

None

invalid_frequencies

[]

modifies_features

False

modifies_target

True

name

Decomposer

needs_fitting

True

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

determine_periodicity

Function that uses autocorrelative methods to determine the first, signficant period of the seasonal signal.

fit

Fits component to data.

fit_transform

Removes fitted trend and seasonality from target variable.

get_trend_dataframe

Return a list of dataframes, each with 3 columns: trend, seasonality, residual.

inverse_transform

Add the trend + seasonality back to y.

is_freq_valid

Determines if the given string represents a valid frequency for this decomposer.

load

Loads component at file path.

parameters

Returns the parameters which were used to initialize the component.

plot_decomposition

Plots the decomposition of the target signal.

save

Saves component at file path.

set_seasonal_period

Function to set the component's seasonal period based on the target's seasonality.

transform

Transforms data X.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

determine_periodicity(self, X: pandas.DataFrame, y: pandas.Series, method: str = 'autocorrelation')[source]#

Function that uses autocorrelative methods to determine the first, signficant period of the seasonal signal.

Parameters
  • X (pandas.DataFrame) – The feature data of the time series problem.

  • y (pandas.Series) – The target data of a time series problem.

  • method (str) – Either “autocorrelation” or “partial-autocorrelation”. The method by which to determine the first period of the seasonal part of the target signal. “partial-autocorrelation” should currently not be used. Defaults to “autocorrelation”.

Returns

The integer numbers of entries in time series data over which the seasonal part of the target data

repeats. If the time series data is in days, then this is the number of days that it takes the target’s seasonal signal to repeat. Note: the target data can contain multiple seasonal signals. This function will only return the first, and thus, shortest period. E.g. if the target has both weekly and yearly seasonality, the function will only return “7” and not return “365”. If no period is detected, returns [None].

Return type

(list[int])

fit(self, X, y=None)#

Fits component to data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features]

  • y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

MethodPropertyNotFoundError – If component does not have a fit method or a component_obj that implements fit.

fit_transform(self, X: pandas.DataFrame, y: pandas.Series = None) tuple[pandas.DataFrame, pandas.Series][source]#

Removes fitted trend and seasonality from target variable.

Parameters
  • X (pd.DataFrame, optional) – Ignored.

  • y (pd.Series) – Target variable to detrend and deseasonalize.

Returns

The first element are the input features returned without modification.

The second element is the target variable y with the fitted trend removed.

Return type

tuple of pd.DataFrame, pd.Series

abstract get_trend_dataframe(self, y: pandas.Series)[source]#

Return a list of dataframes, each with 3 columns: trend, seasonality, residual.

abstract inverse_transform(self, y: pandas.Series)[source]#

Add the trend + seasonality back to y.

classmethod is_freq_valid(self, freq: str)[source]#

Determines if the given string represents a valid frequency for this decomposer.

Parameters

freq (str) – A frequency to validate. See the pandas docs at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for options.

Returns

boolean representing whether the frequency is valid or not.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

property parameters(self)#

Returns the parameters which were used to initialize the component.

plot_decomposition(self, X: pandas.DataFrame, y: pandas.Series, show: bool = False) tuple[matplotlib.pyplot.Figure, list][source]#

Plots the decomposition of the target signal.

Parameters
  • X (pd.DataFrame) – Input data with time series data in index.

  • y (pd.Series or pd.DataFrame) – Target variable data provided as a Series for univariate problems or a DataFrame for multivariate problems.

  • show (bool) – Whether to display the plot or not. Defaults to False.

Returns

The figure and axes that have the decompositions

plotted on them

Return type

matplotlib.pyplot.Figure, list[matplotlib.pyplot.Axes]

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

set_seasonal_period(self, X: pandas.DataFrame, y: pandas.Series)[source]#

Function to set the component’s seasonal period based on the target’s seasonality.

Parameters
  • X (pandas.DataFrame) – The feature data of the time series problem.

  • y (pandas.Series) – The target data of a time series problem.

abstract transform(self, X, y=None)#

Transforms data X.

Parameters
  • X (pd.DataFrame) – Data to transform.

  • y (pd.Series, optional) – Target data.

Returns

Transformed X

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

class evalml.pipelines.components.transformers.preprocessing.DFSTransformer(index='index', features=None, random_seed=0, **kwargs)[source]#

Featuretools DFS component that generates features for the input features.

Parameters
  • index (string) – The name of the column that contains the indices. If no column with this name exists, then featuretools.EntitySet() creates a column with this name to serve as the index column. Defaults to ‘index’.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

  • features (list) – List of features to run DFS on. Defaults to None. Features will only be computed if the columns used by the feature exist in the input and if the feature itself is not in input.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

False

name

DFS Transformer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits the DFSTransformer Transformer component.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Computes the feature matrix for the input X using featuretools' dfs algorithm.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the DFSTransformer Transformer component.

Parameters
  • X (pd.DataFrame, np.array) – The input data to transform, of shape [n_samples, n_features].

  • y (pd.Series) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Computes the feature matrix for the input X using featuretools’ dfs algorithm.

Parameters
  • X (pd.DataFrame or np.ndarray) – The input training data to transform. Has shape [n_samples, n_features]

  • y (pd.Series, optional) – Ignored.

Returns

Feature matrix

Return type

pd.DataFrame

class evalml.pipelines.components.transformers.preprocessing.DropNaNRowsTransformer(parameters=None, component_obj=None, random_seed=0, **kwargs)[source]#

Transformer to drop rows with NaN values.

Parameters

random_seed (int) – Seed for the random number generator. Is not used by this component. Defaults to 0.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

True

name

Drop NaN Rows Transformer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits component to data.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data using fitted component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits component to data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features].

  • y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data using fitted component.

Parameters
  • X (pd.DataFrame) – Features.

  • y (pd.Series, optional) – Target data.

Returns

Data with NaN rows dropped.

Return type

(pd.DataFrame, pd.Series)

class evalml.pipelines.components.transformers.preprocessing.DropNullColumns(pct_null_threshold=1.0, random_seed=0, **kwargs)[source]#

Transformer to drop features whose percentage of NaN values exceeds a specified threshold.

Parameters
  • pct_null_threshold (float) – The percentage of NaN values in an input feature to drop. Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values. If equal to 1.0, will drop columns with all null values. Defaults to 0.95.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

False

name

Drop Null Columns Transformer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits component to data.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data X by dropping columns that exceed the threshold of null values.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits component to data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features].

  • y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data X by dropping columns that exceed the threshold of null values.

Parameters
  • X (pd.DataFrame) – Data to transform

  • y (pd.Series, optional) – Ignored.

Returns

Transformed X

Return type

pd.DataFrame

class evalml.pipelines.components.transformers.preprocessing.DropRowsTransformer(indices_to_drop=None, random_seed=0)[source]#

Transformer to drop rows specified by row indices.

Parameters
  • indices_to_drop (list) – List of indices to drop in the input data. Defaults to None.

  • random_seed (int) – Seed for the random number generator. Is not used by this component. Defaults to 0.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

True

name

Drop Rows Transformer

training_only

True

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits component to data.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data using fitted component.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits component to data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features].

  • y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

Raises

ValueError – If indices to drop do not exist in input features or target.

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data using fitted component.

Parameters
  • X (pd.DataFrame) – Features.

  • y (pd.Series, optional) – Target data.

Returns

Data with row indices dropped.

Return type

(pd.DataFrame, pd.Series)

class evalml.pipelines.components.transformers.preprocessing.EmailFeaturizer(random_seed=0, **kwargs)[source]#

Transformer that can automatically extract features from emails.

Parameters

random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

False

name

Email Featurizer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits component to data.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data X.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)#

Fits component to data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features]

  • y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

MethodPropertyNotFoundError – If component does not have a fit method or a component_obj that implements fit.

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)#

Transforms data X.

Parameters
  • X (pd.DataFrame) – Data to transform.

  • y (pd.Series, optional) – Target data.

Returns

Transformed X

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

class evalml.pipelines.components.transformers.preprocessing.LogTransformer(random_seed=0)[source]#

Applies a log transformation to the target data.

Attributes

hyperparameter_ranges

{}

modifies_features

False

modifies_target

True

name

Log Transformer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits the LogTransformer.

fit_transform

Log transforms the target variable.

inverse_transform

Apply exponential to target data.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Log transforms the target variable.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the LogTransformer.

Parameters
  • X (pd.DataFrame or np.ndarray) – Ignored.

  • y (pd.Series, optional) – Ignored.

Returns

self

fit_transform(self, X, y=None)[source]#

Log transforms the target variable.

Parameters
  • X (pd.DataFrame, optional) – Ignored.

  • y (pd.Series) – Target variable to log transform.

Returns

The input features are returned without modification. The target

variable y is log transformed.

Return type

tuple of pd.DataFrame, pd.Series

inverse_transform(self, y)[source]#

Apply exponential to target data.

Parameters

y (pd.Series) – Target variable.

Returns

Target with exponential applied.

Return type

pd.Series

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Log transforms the target variable.

Parameters
  • X (pd.DataFrame, optional) – Ignored.

  • y (pd.Series) – Target data to log transform.

Returns

The input features are returned without modification. The target

variable y is log transformed.

Return type

tuple of pd.DataFrame, pd.Series

class evalml.pipelines.components.transformers.preprocessing.LSA(random_seed=0, **kwargs)[source]#

Transformer to calculate the Latent Semantic Analysis Values of text input.

Parameters

random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

False

name

LSA Transformer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits the input data.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data X by applying the LSA pipeline.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the input data.

Parameters
  • X (pd.DataFrame) – The data to transform.

  • y (pd.Series, optional) – Ignored.

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data X by applying the LSA pipeline.

Parameters
  • X (pd.DataFrame) – The data to transform.

  • y (pd.Series, optional) – Ignored.

Returns

Transformed X. The original column is removed and replaced with two columns of the

format LSA(original_column_name)[feature_number], where feature_number is 0 or 1.

Return type

pd.DataFrame

class evalml.pipelines.components.transformers.preprocessing.NaturalLanguageFeaturizer(random_seed=0, **kwargs)[source]#

Transformer that can automatically featurize text columns using featuretools’ nlp_primitives.

Since models cannot handle non-numeric data, any text must be broken down into features that provide useful information about that text. This component splits each text column into several informative features: Diversity Score, Mean Characters per Word, Polarity Score, LSA (Latent Semantic Analysis), Number of Characters, and Number of Words. Calling transform on this component will replace any text columns in the given dataset with these numeric columns.

Parameters

random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

False

name

Natural Language Featurizer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits component to data.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data X by creating new features using existing text columns.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits component to data.

Parameters
  • X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]

  • y (pd.Series) – The target training data of length [n_samples]

Returns

self

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data X by creating new features using existing text columns.

Parameters
  • X (pd.DataFrame) – The data to transform.

  • y (pd.Series, optional) – Ignored.

Returns

Transformed X

Return type

pd.DataFrame

class evalml.pipelines.components.transformers.preprocessing.PolynomialDecomposer(time_index: str = None, degree: int = 1, seasonal_period: int = - 1, random_seed: int = 0, **kwargs)[source]#

Removes trends and seasonality from time series by fitting a polynomial and moving average to the data.

Scikit-learn’s PolynomialForecaster is used to generate the additive trend portion of the target data. A polynomial

will be fit to the data during fit. That additive polynomial trend will be removed during fit so that statsmodel’s seasonal_decompose can determine the addititve seasonality of the data by using rolling averages over the series’ inferred periodicity.

For example, daily time series data will generate rolling averages over the first week of data, normalize out the mean and return those 7 averages repeated over the entire length of the given series. Those seven averages, repeated as many times as necessary to match the length of the given target data, will be used as the seasonal signal of the data.

Parameters
  • time_index (str) – Specifies the name of the column in X that provides the datetime objects. Defaults to None.

  • degree (int) – Degree for the polynomial. If 1, linear model is fit to the data. If 2, quadratic model is fit, etc. Defaults to 1.

  • seasonal_period (int) – The number of entries in the time series data that corresponds to one period of a cyclic signal. For instance, if data is known to possess a weekly seasonal signal, and if the data is daily data, seasonal_period should be 7. For daily data with a yearly seasonal signal, seasonal_period should be 365. Defaults to -1, which uses the statsmodels libarary’s freq_to_period function. statsmodels/statsmodels

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges

{ “degree”: Integer(1, 3)}

invalid_frequencies

[]

modifies_features

False

modifies_target

True

name

Polynomial Decomposer

needs_fitting

True

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

determine_periodicity

Function that uses autocorrelative methods to determine the first, signficant period of the seasonal signal.

fit

Fits the PolynomialDecomposer and determine the seasonal signal.

fit_transform

Removes fitted trend and seasonality from target variable.

get_trend_dataframe

Return a list of dataframes with 4 columns: signal, trend, seasonality, residual.

inverse_transform

Adds back fitted trend and seasonality to target variable.

is_freq_valid

Determines if the given string represents a valid frequency for this decomposer.

load

Loads component at file path.

parameters

Returns the parameters which were used to initialize the component.

plot_decomposition

Plots the decomposition of the target signal.

save

Saves component at file path.

set_seasonal_period

Function to set the component's seasonal period based on the target's seasonality.

transform

Transforms the target data by removing the polynomial trend and rolling average seasonality.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

determine_periodicity(self, X: pandas.DataFrame, y: pandas.Series, method: str = 'autocorrelation')#

Function that uses autocorrelative methods to determine the first, signficant period of the seasonal signal.

Parameters
  • X (pandas.DataFrame) – The feature data of the time series problem.

  • y (pandas.Series) – The target data of a time series problem.

  • method (str) – Either “autocorrelation” or “partial-autocorrelation”. The method by which to determine the first period of the seasonal part of the target signal. “partial-autocorrelation” should currently not be used. Defaults to “autocorrelation”.

Returns

The integer numbers of entries in time series data over which the seasonal part of the target data

repeats. If the time series data is in days, then this is the number of days that it takes the target’s seasonal signal to repeat. Note: the target data can contain multiple seasonal signals. This function will only return the first, and thus, shortest period. E.g. if the target has both weekly and yearly seasonality, the function will only return “7” and not return “365”. If no period is detected, returns [None].

Return type

(list[int])

fit(self, X: pandas.DataFrame, y: pandas.Series = None) PolynomialDecomposer[source]#

Fits the PolynomialDecomposer and determine the seasonal signal.

Currently only fits the polynomial detrender. The seasonality is determined by removing the trend from the signal and using statsmodels’ seasonal_decompose(). Both the trend and seasonality are currently assumed to be additive.

Parameters
  • X (pd.DataFrame, optional) – Conditionally used to build datetime index.

  • y (pd.Series) – Target variable to detrend and deseasonalize.

Returns

self

Raises
  • NotImplementedError – If the input data has a frequency of “month-begin”. This isn’t supported by statsmodels decompose as the freqstr “MS” is misinterpreted as milliseconds.

  • ValueError – If y is None.

  • ValueError – If target data doesn’t have DatetimeIndex AND no Datetime features in features data

fit_transform(self, X: pandas.DataFrame, y: pandas.Series = None) tuple[pandas.DataFrame, pandas.Series]#

Removes fitted trend and seasonality from target variable.

Parameters
  • X (pd.DataFrame, optional) – Ignored.

  • y (pd.Series) – Target variable to detrend and deseasonalize.

Returns

The first element are the input features returned without modification.

The second element is the target variable y with the fitted trend removed.

Return type

tuple of pd.DataFrame, pd.Series

get_trend_dataframe(self, X: pandas.DataFrame, y: pandas.Series) list[pandas.DataFrame][source]#

Return a list of dataframes with 4 columns: signal, trend, seasonality, residual.

Scikit-learn’s PolynomialForecaster is used to generate the trend portion of the target data. statsmodel’s seasonal_decompose is used to generate the seasonality of the data.

Parameters
  • X (pd.DataFrame) – Input data with time series data in index.

  • y (pd.Series or pd.DataFrame) – Target variable data provided as a Series for univariate problems or a DataFrame for multivariate problems.

Returns

Each DataFrame contains the columns “signal”, “trend”, “seasonality” and “residual,”

with the latter 3 column values being the decomposed elements of the target data. The “signal” column is simply the input target signal but reindexed with a datetime index to match the input features.

Return type

list of pd.DataFrame

Raises
  • TypeError – If X does not have time-series data in the index.

  • ValueError – If time series index of X does not have an inferred frequency.

  • ValueError – If the forecaster associated with the detrender has not been fit yet.

  • TypeError – If y is not provided as a pandas Series or DataFrame.

inverse_transform(self, y_t: pandas.Series) tuple[pandas.DataFrame, pandas.Series][source]#

Adds back fitted trend and seasonality to target variable.

The polynomial trend is added back into the signal, calling the detrender’s inverse_transform(). Then, the seasonality is projected forward to and added back into the signal.

Parameters

y_t (pd.Series) – Target variable.

Returns

The first element are the input features returned without modification.

The second element is the target variable y with the trend and seasonality added back in.

Return type

tuple of pd.DataFrame, pd.Series

Raises

ValueError – If y is None.

classmethod is_freq_valid(self, freq: str)#

Determines if the given string represents a valid frequency for this decomposer.

Parameters

freq (str) – A frequency to validate. See the pandas docs at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for options.

Returns

boolean representing whether the frequency is valid or not.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

property parameters(self)#

Returns the parameters which were used to initialize the component.

plot_decomposition(self, X: pandas.DataFrame, y: pandas.Series, show: bool = False) tuple[matplotlib.pyplot.Figure, list]#

Plots the decomposition of the target signal.

Parameters
  • X (pd.DataFrame) – Input data with time series data in index.

  • y (pd.Series or pd.DataFrame) – Target variable data provided as a Series for univariate problems or a DataFrame for multivariate problems.

  • show (bool) – Whether to display the plot or not. Defaults to False.

Returns

The figure and axes that have the decompositions

plotted on them

Return type

matplotlib.pyplot.Figure, list[matplotlib.pyplot.Axes]

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

set_seasonal_period(self, X: pandas.DataFrame, y: pandas.Series)#

Function to set the component’s seasonal period based on the target’s seasonality.

Parameters
  • X (pandas.DataFrame) – The feature data of the time series problem.

  • y (pandas.Series) – The target data of a time series problem.

transform(self, X: pandas.DataFrame, y: pandas.Series = None) tuple[pandas.DataFrame, pandas.Series][source]#

Transforms the target data by removing the polynomial trend and rolling average seasonality.

Applies the fit polynomial detrender to the target data, removing the additive polynomial trend. Then, utilizes the first period’s worth of seasonal data determined in the .fit() function to extrapolate the seasonal signal of the data to be transformed. This seasonal signal is also assumed to be additive and is removed.

Parameters
  • X (pd.DataFrame, optional) – Conditionally used to build datetime index.

  • y (pd.Series) – Target variable to detrend and deseasonalize.

Returns

The input features are returned without modification. The target

variable y is detrended and deseasonalized.

Return type

tuple of pd.DataFrame, pd.Series

Raises

ValueError – If target data doesn’t have DatetimeIndex AND no Datetime features in features data

class evalml.pipelines.components.transformers.preprocessing.ReplaceNullableTypes(random_seed=0, **kwargs)[source]#

Transformer to replace features with the new nullable dtypes with a dtype that is compatible in EvalML.

Attributes

hyperparameter_ranges

None

modifies_features

True

modifies_target

{}

name

Replace Nullable Types Transformer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits component to data.

fit_transform

Substitutes non-nullable types for the new pandas nullable types in the data and target data.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data by replacing columns that contain nullable types with the appropriate replacement type.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits component to data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features].

  • y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

fit_transform(self, X, y=None)[source]#

Substitutes non-nullable types for the new pandas nullable types in the data and target data.

Parameters
  • X (pd.DataFrame, optional) – Input features.

  • y (pd.Series) – Target data.

Returns

The input features and target data with the non-nullable types set.

Return type

tuple of pd.DataFrame, pd.Series

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Transforms data by replacing columns that contain nullable types with the appropriate replacement type.

“float64” for nullable integers and “category” for nullable booleans.

Parameters
  • X (pd.DataFrame) – Data to transform

  • y (pd.Series, optional) – Target data to transform

Returns

Transformed X pd.Series: Transformed y

Return type

pd.DataFrame

class evalml.pipelines.components.transformers.preprocessing.STLDecomposer(time_index: str = None, degree: int = 1, seasonal_period: int = 7, random_seed: int = 0, **kwargs)[source]#

Removes trends and seasonality from time series using the STL algorithm.

https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.STL.html

Parameters
  • time_index (str) – Specifies the name of the column in X that provides the datetime objects. Defaults to None.

  • degree (int) – Not currently used. STL 3x “degree-like” values. None are able to be set at this time. Defaults to 1.

  • seasonal_period (int) – The number of entries in the time series data that corresponds to one period of a cyclic signal. For instance, if data is known to possess a weekly seasonal signal, and if the data is daily data, seasonal_period should be 7. For daily data with a yearly seasonal signal, seasonal_period should be 365. For compatibility with the underlying STL algorithm, must be odd. If an even number is provided, the next, highest odd number will be used. Defaults to 7.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges

None

invalid_frequencies

[‘SM’, ‘BM’, ‘SMS’, ‘BMS’, ‘BQ’, ‘BQS’, ‘T’, ‘S’, ‘L’, ‘U’, ‘N’, ‘A’, ‘BA’, ‘AS’, ‘BAS’, ‘BH’]

modifies_features

False

modifies_target

True

name

STL Decomposer

needs_fitting

True

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

determine_periodicity

Function that uses autocorrelative methods to determine the first, signficant period of the seasonal signal.

fit

Fits the STLDecomposer and determine the seasonal signal.

fit_transform

Removes fitted trend and seasonality from target variable.

get_trend_dataframe

Return a list of dataframes with 4 columns: signal, trend, seasonality, residual.

inverse_transform

Adds back fitted trend and seasonality to target variable.

is_freq_valid

Determines if the given string represents a valid frequency for this decomposer.

load

Loads component at file path.

parameters

Returns the parameters which were used to initialize the component.

plot_decomposition

Plots the decomposition of the target signal.

save

Saves component at file path.

set_seasonal_period

Function to set the component's seasonal period based on the target's seasonality.

transform

Transforms the target data by removing the STL trend and seasonality.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

determine_periodicity(self, X: pandas.DataFrame, y: pandas.Series, method: str = 'autocorrelation')#

Function that uses autocorrelative methods to determine the first, signficant period of the seasonal signal.

Parameters
  • X (pandas.DataFrame) – The feature data of the time series problem.

  • y (pandas.Series) – The target data of a time series problem.

  • method (str) – Either “autocorrelation” or “partial-autocorrelation”. The method by which to determine the first period of the seasonal part of the target signal. “partial-autocorrelation” should currently not be used. Defaults to “autocorrelation”.

Returns

The integer numbers of entries in time series data over which the seasonal part of the target data

repeats. If the time series data is in days, then this is the number of days that it takes the target’s seasonal signal to repeat. Note: the target data can contain multiple seasonal signals. This function will only return the first, and thus, shortest period. E.g. if the target has both weekly and yearly seasonality, the function will only return “7” and not return “365”. If no period is detected, returns [None].

Return type

(list[int])

fit(self, X: pandas.DataFrame, y: pandas.Series = None) STLDecomposer[source]#

Fits the STLDecomposer and determine the seasonal signal.

Instantiates a statsmodels STL decompose object with the component’s stored parameters and fits it. Since the statsmodels object does not fit the sklearn api, it is not saved during __init__() in _component_obj and will be re-instantiated each time fit is called.

To emulate the sklearn API, when the STL decomposer is fit, the full seasonal component, a single period sample of the seasonal component, the full trend-cycle component and the residual are saved.

y(t) = S(t) + T(t) + R(t)

Parameters
  • X (pd.DataFrame, optional) – Conditionally used to build datetime index.

  • y (pd.Series) – Target variable to detrend and deseasonalize.

Returns

self

Raises
  • ValueError – If y is None.

  • ValueError – If target data doesn’t have DatetimeIndex AND no Datetime features in features data

fit_transform(self, X: pandas.DataFrame, y: pandas.Series = None) tuple[pandas.DataFrame, pandas.Series]#

Removes fitted trend and seasonality from target variable.

Parameters
  • X (pd.DataFrame, optional) – Ignored.

  • y (pd.Series) – Target variable to detrend and deseasonalize.

Returns

The first element are the input features returned without modification.

The second element is the target variable y with the fitted trend removed.

Return type

tuple of pd.DataFrame, pd.Series

get_trend_dataframe(self, X, y)[source]#

Return a list of dataframes with 4 columns: signal, trend, seasonality, residual.

Parameters
  • X (pd.DataFrame) – Input data with time series data in index.

  • y (pd.Series or pd.DataFrame) – Target variable data provided as a Series for univariate problems or a DataFrame for multivariate problems.

Returns

Each DataFrame contains the columns “signal”, “trend”, “seasonality” and “residual,”

with the latter 3 column values being the decomposed elements of the target data. The “signal” column is simply the input target signal but reindexed with a datetime index to match the input features.

Return type

list of pd.DataFrame

Raises
  • TypeError – If X does not have time-series data in the index.

  • ValueError – If time series index of X does not have an inferred frequency.

  • ValueError – If the forecaster associated with the detrender has not been fit yet.

  • TypeError – If y is not provided as a pandas Series or DataFrame.

inverse_transform(self, y_t: pandas.Series) tuple[pandas.DataFrame, pandas.Series][source]#

Adds back fitted trend and seasonality to target variable.

The STL trend is projected to cover the entire requested target range, then added back into the signal. Then, the seasonality is projected forward to and added back into the signal.

Parameters

y_t (pd.Series) – Target variable.

Returns

The first element are the input features returned without modification.

The second element is the target variable y with the trend and seasonality added back in.

Return type

tuple of pd.DataFrame, pd.Series

Raises

ValueError – If y is None.

classmethod is_freq_valid(self, freq: str)#

Determines if the given string represents a valid frequency for this decomposer.

Parameters

freq (str) – A frequency to validate. See the pandas docs at https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for options.

Returns

boolean representing whether the frequency is valid or not.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

property parameters(self)#

Returns the parameters which were used to initialize the component.

plot_decomposition(self, X: pandas.DataFrame, y: pandas.Series, show: bool = False) tuple[matplotlib.pyplot.Figure, list]#

Plots the decomposition of the target signal.

Parameters
  • X (pd.DataFrame) – Input data with time series data in index.

  • y (pd.Series or pd.DataFrame) – Target variable data provided as a Series for univariate problems or a DataFrame for multivariate problems.

  • show (bool) – Whether to display the plot or not. Defaults to False.

Returns

The figure and axes that have the decompositions

plotted on them

Return type

matplotlib.pyplot.Figure, list[matplotlib.pyplot.Axes]

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

set_seasonal_period(self, X: pandas.DataFrame, y: pandas.Series)#

Function to set the component’s seasonal period based on the target’s seasonality.

Parameters
  • X (pandas.DataFrame) – The feature data of the time series problem.

  • y (pandas.Series) – The target data of a time series problem.

transform(self, X: pandas.DataFrame, y: pandas.Series = None) tuple[pandas.DataFrame, pandas.Series][source]#

Transforms the target data by removing the STL trend and seasonality.

Uses an ARIMA model to project forward the addititve trend and removes it. Then, utilizes the first period’s worth of seasonal data determined in the .fit() function to extrapolate the seasonal signal of the data to be transformed. This seasonal signal is also assumed to be additive and is removed.

Parameters
  • X (pd.DataFrame, optional) – Conditionally used to build datetime index.

  • y (pd.Series) – Target variable to detrend and deseasonalize.

Returns

The input features are returned without modification. The target

variable y is detrended and deseasonalized.

Return type

tuple of pd.DataFrame, pd.Series

Raises

ValueError – If target data doesn’t have DatetimeIndex AND no Datetime features in features data

class evalml.pipelines.components.transformers.preprocessing.TextTransformer(component_obj=None, random_seed=0, **kwargs)[source]#

Base class for all transformers working with text features.

Parameters
  • component_obj (obj) – Third-party objects useful in component implementation. Defaults to None.

  • random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

modifies_features

True

modifies_target

False

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits component to data.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

name

Returns string name of this component.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data X.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)#

Fits component to data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features]

  • y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

MethodPropertyNotFoundError – If component does not have a fit method or a component_obj that implements fit.

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

property name(cls)#

Returns string name of this component.

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

abstract transform(self, X, y=None)#

Transforms data X.

Parameters
  • X (pd.DataFrame) – Data to transform.

  • y (pd.Series, optional) – Target data.

Returns

Transformed X

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

class evalml.pipelines.components.transformers.preprocessing.TimeSeriesFeaturizer(time_index=None, max_delay=2, gap=0, forecast_horizon=1, conf_level=0.05, rolling_window_size=0.25, delay_features=True, delay_target=True, random_seed=0, **kwargs)[source]#

Transformer that delays input features and target variable for time series problems.

This component uses an algorithm based on the autocorrelation values of the target variable to determine which lags to select from the set of all possible lags.

The algorithm is based on the idea that the local maxima of the autocorrelation function indicate the lags that have the most impact on the present time.

The algorithm computes the autocorrelation values and finds the local maxima, called “peaks”, that are significant at the given conf_level. Since lags in the range [0, 10] tend to be predictive but not local maxima, the union of the peaks is taken with the significant lags in the range [0, 10]. At the end, only selected lags in the range [0, max_delay] are used.

Parametrizing the algorithm by conf_level lets the AutoMLAlgorithm tune the set of lags chosen so that the chances of finding a good set of lags is higher.

Using conf_level value of 1 selects all possible lags.

Parameters
  • time_index (str) – Name of the column containing the datetime information used to order the data. Ignored.

  • max_delay (int) – Maximum number of time units to delay each feature. Defaults to 2.

  • forecast_horizon (int) – The number of time periods the pipeline is expected to forecast.

  • conf_level (float) – Float in range (0, 1] that determines the confidence interval size used to select which lags to compute from the set of [1, max_delay]. A delay of 1 will always be computed. If 1, selects all possible lags in the set of [1, max_delay], inclusive.

  • rolling_window_size (float) – Float in range (0, 1] that determines the size of the window used for rolling features. Size is computed as rolling_window_size * max_delay.

  • delay_features (bool) – Whether to delay the input features. Defaults to True.

  • delay_target (bool) – Whether to delay the target. Defaults to True.

  • gap (int) – The number of time units between when the features are collected and when the target is collected. For example, if you are predicting the next time step’s target, gap=1. This is only needed because when gap=0, we need to be sure to start the lagging of the target variable at 1. Defaults to 1.

  • random_seed (int) – Seed for the random number generator. This transformer performs the same regardless of the random seed provided.

Attributes

hyperparameter_ranges

Real(0.001, 1.0), “rolling_window_size”: Real(0.001, 1.0)}:type: {“conf_level”

modifies_features

True

modifies_target

False

name

Time Series Featurizer

needs_fitting

True

target_colname_prefix

target_delay_{}

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits the DelayFeatureTransformer.

fit_transform

Fit the component and transform the input data.

load

Loads component at file path.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Computes the delayed values and rolling means for X and y.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the DelayFeatureTransformer.

Parameters
  • X (pd.DataFrame or np.ndarray) – The input training data of shape [n_samples, n_features]

  • y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

ValueError – if self.time_index is None

fit_transform(self, X, y=None)[source]#

Fit the component and transform the input data.

Parameters
  • X (pd.DataFrame) – Data to transform.

  • y (pd.Series, or None) – Target.

Returns

Transformed X.

Return type

pd.DataFrame

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Computes the delayed values and rolling means for X and y.

The chosen delays are determined by the autocorrelation function of the target variable. See the class docstring for more information on how they are chosen. If y is None, all possible lags are chosen.

If y is not None, it will also compute the delayed values for the target variable.

The rolling means for all numeric features in X and y, if y is numeric, are also returned.

Parameters
  • X (pd.DataFrame or None) – Data to transform. None is expected when only the target variable is being used.

  • y (pd.Series, or None) – Target.

Returns

Transformed X. No original features are returned.

Return type

pd.DataFrame

class evalml.pipelines.components.transformers.preprocessing.TimeSeriesRegularizer(time_index=None, frequency_payload=None, window_length=4, threshold=0.4, random_seed=0, **kwargs)[source]#

Transformer that regularizes an inconsistently spaced datetime column.

If X is passed in to fit/transform, the column time_index will be checked for an inferrable offset frequency. If the time_index column is perfectly inferrable then this Transformer will do nothing and return the original X and y.

If X does not have a perfectly inferrable frequency but one can be estimated, then X and y will be reformatted based on the estimated frequency for time_index. In the original X and y passed: - Missing datetime values will be added and will have their corresponding columns in X and y set to None. - Duplicate datetime values will be dropped. - Extra datetime values will be dropped. - If it can be determined that a duplicate or extra value is misaligned, then it will be repositioned to take the place of a missing value.

This Transformer should be used before the TimeSeriesImputer in order to impute the missing values that were added to X and y (if passed).

Parameters
  • time_index (string) – Name of the column containing the datetime information used to order the data, required. Defaults to None.

  • frequency_payload (tuple) – Payload returned from Woodwork’s infer_frequency function where debug is True. Defaults to None.

  • window_length (int) – The size of the rolling window over which inference is conducted to determine the prevalence of uninferrable frequencies.

  • 5. (Lower values make this component more sensitive to recognizing numerous faulty datetime values. Defaults to) –

  • threshold (float) – The minimum percentage of windows that need to have been able to infer a frequency. Lower values make this component more

  • 0.8. (sensitive to recognizing numerous faulty datetime values. Defaults to) –

  • random_seed (int) – Seed for the random number generator. This transformer performs the same regardless of the random seed provided.

  • 0. (Defaults to) –

Raises

ValueError – if the frequency_payload parameter has not been passed a tuple

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

True

name

Time Series Regularizer

training_only

True

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits the TimeSeriesRegularizer.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Regularizes a dataframe and target data to an inferrable offset frequency.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)[source]#

Fits the TimeSeriesRegularizer.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features].

  • y (pd.Series, optional) – The target training data of length [n_samples].

Returns

self

Raises
  • ValueError – if self.time_index is None, if X and y have different lengths, if time_index in X does not have an offset frequency that can be estimated

  • TypeError – if the time_index column is not of type Datetime

  • KeyError – if the time_index column doesn’t exist

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)[source]#

Regularizes a dataframe and target data to an inferrable offset frequency.

A ‘clean’ X and y (if y was passed in) are created based on an inferrable offset frequency and matching datetime values with the original X and y are imputed into the clean X and y. Datetime values identified as misaligned are shifted into their appropriate position.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features].

  • y (pd.Series, optional) – The target training data of length [n_samples].

Returns

Data with an inferrable time_index offset frequency.

Return type

(pd.DataFrame, pd.Series)

class evalml.pipelines.components.transformers.preprocessing.URLFeaturizer(random_seed=0, **kwargs)[source]#

Transformer that can automatically extract features from URL.

Parameters

random_seed (int) – Seed for the random number generator. Defaults to 0.

Attributes

hyperparameter_ranges

{}

modifies_features

True

modifies_target

False

name

URL Featurizer

training_only

False

Methods

clone

Constructs a new component with the same parameters and random state.

default_parameters

Returns the default parameters for this component.

describe

Describe a component and its parameters.

fit

Fits component to data.

fit_transform

Fits on X and transforms X.

load

Loads component at file path.

needs_fitting

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

parameters

Returns the parameters which were used to initialize the component.

save

Saves component at file path.

transform

Transforms data X.

clone(self)#

Constructs a new component with the same parameters and random state.

Returns

A new instance of this component with identical parameters and random state.

default_parameters(cls)#

Returns the default parameters for this component.

Our convention is that Component.default_parameters == Component().parameters.

Returns

Default parameters for this component.

Return type

dict

describe(self, print_name=False, return_dict=False)#

Describe a component and its parameters.

Parameters
  • print_name (bool, optional) – whether to print name of component

  • return_dict (bool, optional) – whether to return description as dictionary in the format {“name”: name, “parameters”: parameters}

Returns

Returns dictionary if return_dict is True, else None.

Return type

None or dict

fit(self, X, y=None)#

Fits component to data.

Parameters
  • X (pd.DataFrame) – The input training data of shape [n_samples, n_features]

  • y (pd.Series, optional) – The target training data of length [n_samples]

Returns

self

Raises

MethodPropertyNotFoundError – If component does not have a fit method or a component_obj that implements fit.

fit_transform(self, X, y=None)#

Fits on X and transforms X.

Parameters
  • X (pd.DataFrame) – Data to fit and transform.

  • y (pd.Series) – Target data.

Returns

Transformed X.

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.

static load(file_path)#

Loads component at file path.

Parameters

file_path (str) – Location to load file.

Returns

ComponentBase object

needs_fitting(self)#

Returns boolean determining if component needs fitting before calling predict, predict_proba, transform, or feature_importances.

This can be overridden to False for components that do not need to be fit or whose fit methods do nothing.

Returns

True.

property parameters(self)#

Returns the parameters which were used to initialize the component.

save(self, file_path, pickle_protocol=cloudpickle.DEFAULT_PROTOCOL)#

Saves component at file path.

Parameters
  • file_path (str) – Location to save file.

  • pickle_protocol (int) – The pickle data stream format.

transform(self, X, y=None)#

Transforms data X.

Parameters
  • X (pd.DataFrame) – Data to transform.

  • y (pd.Series, optional) – Target data.

Returns

Transformed X

Return type

pd.DataFrame

Raises

MethodPropertyNotFoundError – If transformer does not have a transform method or a component_obj that implements transform.