gen_utils#

General utility methods.

Module Contents#

Classes Summary#

classproperty

Allows function to be accessed as a class level property.

Functions#

are_datasets_separated_by_gap_time_index

Determine if the train and test datasets are separated by gap number of units using the time_index.

are_ts_parameters_valid_for_split

Validates the time series parameters in problem_configuration are compatible with split sizes.

contains_all_ts_parameters

Validates that the problem configuration contains all required keys.

convert_to_seconds

Converts a string describing a length of time to its length in seconds.

deprecate_arg

Helper to raise warnings when a deprecated arg is used.

drop_rows_with_nans

Drop rows that have any NaNs in all dataframes or series.

get_importable_subclasses

Get importable subclasses of a base class. Used to list all of our estimators, transformers, components and pipelines dynamically.

get_random_seed

Given a numpy.random.RandomState object, generate an int representing a seed value for another random number generator. Or, if given an int, return that int.

get_random_state

Generates a numpy.random.RandomState instance using seed.

get_time_index

Determines the column in the given data that should be used as the time index.

import_or_raise

Attempts to import the requested library by name. If the import fails, raises an ImportError or warning.

is_all_numeric

Checks if the given DataFrame contains only numeric values.

is_categorical_actually_boolean

Function to identify columns of a dataframe that contain True, False and null type.

jupyter_check

Get whether or not the code is being run in a Ipython environment (such as Jupyter Notebook or Jupyter Lab).

pad_with_nans

Pad the beginning num_to_pad rows with nans.

safe_repr

Convert the given value into a string that can safely be used for repr.

save_plot

Saves fig to filepath if specified, or to a default location if not.

validate_holdout_datasets

Validate the holdout datasets match our expectations.

Attributes Summary#

logger

SEED_BOUNDS

Contents#

evalml.utils.gen_utils.are_datasets_separated_by_gap_time_index(train, test, pipeline_params)[source]#

Determine if the train and test datasets are separated by gap number of units using the time_index.

This will be true when users are predicting on unseen data but not during cross validation since the target is known.

Parameters
  • train (pd.DataFrame) – Training data.

  • test (pd.DataFrame) – Data of shape [n_samples, n_features].

  • pipeline_params (dict) – Dictionary of time series parameters.

Returns

True if the difference in time units is equal to gap + 1.

Return type

bool

evalml.utils.gen_utils.are_ts_parameters_valid_for_split(gap, max_delay, forecast_horizon, n_obs, n_splits)[source]#

Validates the time series parameters in problem_configuration are compatible with split sizes.

Parameters
  • gap (int) – gap value.

  • max_delay (int) – max_delay value.

  • forecast_horizon (int) – forecast_horizon value.

  • n_obs (int) – Number of observations in the dataset.

  • n_splits (int) – Number of cross validation splits.

Returns

TsParameterValidationResult - named tuple with four fields

is_valid (bool): True if parameters are valid. msg (str): Contains error message to display. Empty if is_valid. smallest_split_size (int): Smallest split size given n_obs and n_splits. max_window_size (int): Max window size given gap, max_delay, forecast_horizon.

class evalml.utils.gen_utils.classproperty(func)[source]#

Allows function to be accessed as a class level property.

Example: .. code-block:

class LogisticRegressionBinaryPipeline(PipelineBase):
    component_graph = ['Simple Imputer', 'Logistic Regression Classifier']

    @classproperty
    def summary(cls):
    summary = ""
    for component in cls.component_graph:
        component = handle_component_class(component)
        summary += component.name + " + "
    return summary

assert LogisticRegressionBinaryPipeline.summary == "Simple Imputer + Logistic Regression Classifier + "
assert LogisticRegressionBinaryPipeline().summary == "Simple Imputer + Logistic Regression Classifier + "
evalml.utils.gen_utils.contains_all_ts_parameters(problem_configuration)[source]#

Validates that the problem configuration contains all required keys.

Parameters

problem_configuration (dict) – Problem configuration.

Returns

True if the configuration contains all parameters. If False, msg is a non-empty

string with error message.

Return type

bool, str

evalml.utils.gen_utils.convert_to_seconds(input_str)[source]#

Converts a string describing a length of time to its length in seconds.

Parameters

input_str (str) – The string to be parsed and converted to seconds.

Returns

Returns the library if importing succeeded.

Raises

AssertionError – If an invalid unit is used.

Examples

>>> assert convert_to_seconds("10 hr") == 36000.0
>>> assert convert_to_seconds("30 minutes") == 1800.0
>>> assert convert_to_seconds("2.5 min") == 150.0
evalml.utils.gen_utils.deprecate_arg(old_arg, new_arg, old_value, new_value)[source]#

Helper to raise warnings when a deprecated arg is used.

Parameters
  • old_arg (str) – Name of old/deprecated argument.

  • new_arg (str) – Name of new argument.

  • old_value (Any) – Value the user passed in for the old argument.

  • new_value (Any) – Value the user passed in for the new argument.

Returns

old_value if not None, else new_value

evalml.utils.gen_utils.drop_rows_with_nans(*pd_data)[source]#

Drop rows that have any NaNs in all dataframes or series.

Parameters

*pd_data – sequence of pd.Series or pd.DataFrame or None

Returns

list of pd.DataFrame or pd.Series or None

evalml.utils.gen_utils.get_importable_subclasses(base_class, used_in_automl=True)[source]#

Get importable subclasses of a base class. Used to list all of our estimators, transformers, components and pipelines dynamically.

Parameters
  • base_class (abc.ABCMeta) – Base class to find all of the subclasses for.

  • used_in_automl – Not all components/pipelines/estimators are used in automl search. If True, only include those subclasses that are used in the search. This would mean excluding classes related to ExtraTrees, ElasticNet, and Baseline estimators.

Returns

List of subclasses.

evalml.utils.gen_utils.get_random_seed(random_state, min_bound=SEED_BOUNDS.min_bound, max_bound=SEED_BOUNDS.max_bound)[source]#

Given a numpy.random.RandomState object, generate an int representing a seed value for another random number generator. Or, if given an int, return that int.

To protect against invalid input to a particular library’s random number generator, if an int value is provided, and it is outside the bounds “[min_bound, max_bound)”, the value will be projected into the range between the min_bound (inclusive) and max_bound (exclusive) using modular arithmetic.

Parameters
  • random_state (int, numpy.random.RandomState) – random state

  • min_bound (None, int) – if not default of None, will be min bound when generating seed (inclusive). Must be less than max_bound.

  • max_bound (None, int) – if not default of None, will be max bound when generating seed (exclusive). Must be greater than min_bound.

Returns

Seed for random number generator

Return type

int

Raises

ValueError – If boundaries are not valid.

evalml.utils.gen_utils.get_random_state(seed)[source]#

Generates a numpy.random.RandomState instance using seed.

Parameters

seed (None, int, np.random.RandomState object) – seed to use to generate numpy.random.RandomState. Must be between SEED_BOUNDS.min_bound and SEED_BOUNDS.max_bound, inclusive.

Raises

ValueError – If the input seed is not within the acceptable range.

Returns

A numpy.random.RandomState instance.

evalml.utils.gen_utils.get_time_index(X: pandas.DataFrame, y: pandas.Series, time_index_name: str)[source]#

Determines the column in the given data that should be used as the time index.

evalml.utils.gen_utils.import_or_raise(library, error_msg=None, warning=False)[source]#

Attempts to import the requested library by name. If the import fails, raises an ImportError or warning.

Parameters
  • library (str) – The name of the library.

  • error_msg (str) – Error message to return if the import fails.

  • warning (bool) – If True, import_or_raise gives a warning instead of ImportError. Defaults to False.

Returns

Returns the library if importing succeeded.

Raises
  • ImportError – If attempting to import the library fails because the library is not installed.

  • Exception – If importing the library fails.

evalml.utils.gen_utils.is_all_numeric(df)[source]#

Checks if the given DataFrame contains only numeric values.

Parameters

df (pd.DataFrame) – The DataFrame to check data types of.

Returns

True if all the columns are numeric and are not missing any values, False otherwise.

evalml.utils.gen_utils.is_categorical_actually_boolean(df, df_col)[source]#

Function to identify columns of a dataframe that contain True, False and null type.

The function is intended to be applied to columns that are identified as Categorical by the Imputer/SimpleImputer.

Parameters
  • df (pandas.DataFrame) – Pandas dataframe with data.

  • df_col (str) – The column to identify as basically a nullable Boolean.

Returns

Whether the column contains True, False and a null type.

Return type

bool

evalml.utils.gen_utils.jupyter_check()[source]#

Get whether or not the code is being run in a Ipython environment (such as Jupyter Notebook or Jupyter Lab).

Returns

True if Ipython, False otherwise.

Return type

boolean

evalml.utils.gen_utils.logger#
evalml.utils.gen_utils.pad_with_nans(pd_data, num_to_pad)[source]#

Pad the beginning num_to_pad rows with nans.

Parameters
  • pd_data (pd.DataFrame or pd.Series) – Data to pad.

  • num_to_pad (int) – Number of nans to pad.

Returns

pd.DataFrame or pd.Series

evalml.utils.gen_utils.safe_repr(value)[source]#

Convert the given value into a string that can safely be used for repr.

Parameters

value – The item to convert

Returns

String representation of the value

evalml.utils.gen_utils.save_plot(fig, filepath=None, format='png', interactive=False, return_filepath=False)[source]#

Saves fig to filepath if specified, or to a default location if not.

Parameters
  • fig (Figure) – Figure to be saved.

  • filepath (str or Path, optional) – Location to save file. Default is with filename “test_plot”.

  • format (str) – Extension for figure to be saved as. Ignored if interactive is True and fig is of type plotly.Figure. Defaults to ‘png’.

  • interactive (bool, optional) – If True and fig is of type plotly.Figure, saves the fig as interactive instead of static, and format will be set to ‘html’. Defaults to False.

  • return_filepath (bool, optional) – Whether to return the final filepath the image is saved to. Defaults to False.

Returns

String representing the final filepath the image was saved to if return_filepath is set to True. Defaults to None.

evalml.utils.gen_utils.SEED_BOUNDS#
evalml.utils.gen_utils.validate_holdout_datasets(X, X_train, pipeline_params)[source]#

Validate the holdout datasets match our expectations.

This function is run before calling predict in a time series pipeline. It verifies that X (the holdout set) is gap units away from the training set and is less than or equal to the forecast_horizon.

Parameters
  • X (pd.DataFrame) – Data of shape [n_samples, n_features].

  • X_train (pd.DataFrame) – Training data.

  • pipeline_params (dict) – Dictionary of time series parameters with gap, forecast_horizon, and time_index being required.

Returns

TSHoldoutValidationResult - named tuple with three fields

is_valid (bool): True if holdout data is valid. error_messages (list): List of error messages to display. Empty if is_valid. error_codes (list): List of error codes to display. Empty if is_valid.