gen_utils#
General utility methods.
Module Contents#
Classes Summary#
Allows function to be accessed as a class level property. |
Functions#
Determine if the train and test datasets are separated by gap number of units using the time_index. |
|
Validates the time series parameters in problem_configuration are compatible with split sizes. |
|
Validates that the problem configuration contains all required keys. |
|
Converts a string describing a length of time to its length in seconds. |
|
Helper to raise warnings when a deprecated arg is used. |
|
Drop rows that have any NaNs in all dataframes or series. |
|
Get importable subclasses of a base class. Used to list all of our estimators, transformers, components and pipelines dynamically. |
|
Given a numpy.random.RandomState object, generate an int representing a seed value for another random number generator. Or, if given an int, return that int. |
|
Generates a numpy.random.RandomState instance using seed. |
|
Determines the column in the given data that should be used as the time index. |
|
Attempts to import the requested library by name. If the import fails, raises an ImportError or warning. |
|
Checks if the given DataFrame contains only numeric values. |
|
Function to identify columns of a dataframe that contain True, False and null type. |
|
Get whether or not the code is being run in a Ipython environment (such as Jupyter Notebook or Jupyter Lab). |
|
Pad the beginning num_to_pad rows with nans. |
|
Convert the given value into a string that can safely be used for repr. |
|
Saves fig to filepath if specified, or to a default location if not. |
|
Validate the holdout datasets match our expectations. |
Attributes Summary#
Contents#
- evalml.utils.gen_utils.are_datasets_separated_by_gap_time_index(train, test, pipeline_params)[source]#
Determine if the train and test datasets are separated by gap number of units using the time_index.
This will be true when users are predicting on unseen data but not during cross validation since the target is known.
- Parameters
train (pd.DataFrame) – Training data.
test (pd.DataFrame) – Data of shape [n_samples, n_features].
pipeline_params (dict) – Dictionary of time series parameters.
- Returns
True if the difference in time units is equal to gap + 1.
- Return type
bool
- evalml.utils.gen_utils.are_ts_parameters_valid_for_split(gap, max_delay, forecast_horizon, n_obs, n_splits)[source]#
Validates the time series parameters in problem_configuration are compatible with split sizes.
- Parameters
gap (int) – gap value.
max_delay (int) – max_delay value.
forecast_horizon (int) – forecast_horizon value.
n_obs (int) – Number of observations in the dataset.
n_splits (int) – Number of cross validation splits.
- Returns
- TsParameterValidationResult - named tuple with four fields
is_valid (bool): True if parameters are valid. msg (str): Contains error message to display. Empty if is_valid. smallest_split_size (int): Smallest split size given n_obs and n_splits. max_window_size (int): Max window size given gap, max_delay, forecast_horizon.
- class evalml.utils.gen_utils.classproperty(func)[source]#
Allows function to be accessed as a class level property.
Example: .. code-block:
class LogisticRegressionBinaryPipeline(PipelineBase): component_graph = ['Simple Imputer', 'Logistic Regression Classifier'] @classproperty def summary(cls): summary = "" for component in cls.component_graph: component = handle_component_class(component) summary += component.name + " + " return summary assert LogisticRegressionBinaryPipeline.summary == "Simple Imputer + Logistic Regression Classifier + " assert LogisticRegressionBinaryPipeline().summary == "Simple Imputer + Logistic Regression Classifier + "
- evalml.utils.gen_utils.contains_all_ts_parameters(problem_configuration)[source]#
Validates that the problem configuration contains all required keys.
- Parameters
problem_configuration (dict) – Problem configuration.
- Returns
- True if the configuration contains all parameters. If False, msg is a non-empty
string with error message.
- Return type
bool, str
- evalml.utils.gen_utils.convert_to_seconds(input_str)[source]#
Converts a string describing a length of time to its length in seconds.
- Parameters
input_str (str) – The string to be parsed and converted to seconds.
- Returns
Returns the library if importing succeeded.
- Raises
AssertionError – If an invalid unit is used.
Examples
>>> assert convert_to_seconds("10 hr") == 36000.0 >>> assert convert_to_seconds("30 minutes") == 1800.0 >>> assert convert_to_seconds("2.5 min") == 150.0
- evalml.utils.gen_utils.deprecate_arg(old_arg, new_arg, old_value, new_value)[source]#
Helper to raise warnings when a deprecated arg is used.
- Parameters
old_arg (str) – Name of old/deprecated argument.
new_arg (str) – Name of new argument.
old_value (Any) – Value the user passed in for the old argument.
new_value (Any) – Value the user passed in for the new argument.
- Returns
old_value if not None, else new_value
- evalml.utils.gen_utils.drop_rows_with_nans(*pd_data)[source]#
Drop rows that have any NaNs in all dataframes or series.
- Parameters
*pd_data – sequence of pd.Series or pd.DataFrame or None
- Returns
list of pd.DataFrame or pd.Series or None
- evalml.utils.gen_utils.get_importable_subclasses(base_class, used_in_automl=True)[source]#
Get importable subclasses of a base class. Used to list all of our estimators, transformers, components and pipelines dynamically.
- Parameters
base_class (abc.ABCMeta) – Base class to find all of the subclasses for.
used_in_automl – Not all components/pipelines/estimators are used in automl search. If True, only include those subclasses that are used in the search. This would mean excluding classes related to ExtraTrees, ElasticNet, and Baseline estimators.
- Returns
List of subclasses.
- evalml.utils.gen_utils.get_random_seed(random_state, min_bound=SEED_BOUNDS.min_bound, max_bound=SEED_BOUNDS.max_bound)[source]#
Given a numpy.random.RandomState object, generate an int representing a seed value for another random number generator. Or, if given an int, return that int.
To protect against invalid input to a particular library’s random number generator, if an int value is provided, and it is outside the bounds “[min_bound, max_bound)”, the value will be projected into the range between the min_bound (inclusive) and max_bound (exclusive) using modular arithmetic.
- Parameters
random_state (int, numpy.random.RandomState) – random state
min_bound (None, int) – if not default of None, will be min bound when generating seed (inclusive). Must be less than max_bound.
max_bound (None, int) – if not default of None, will be max bound when generating seed (exclusive). Must be greater than min_bound.
- Returns
Seed for random number generator
- Return type
int
- Raises
ValueError – If boundaries are not valid.
- evalml.utils.gen_utils.get_random_state(seed)[source]#
Generates a numpy.random.RandomState instance using seed.
- Parameters
seed (None, int, np.random.RandomState object) – seed to use to generate numpy.random.RandomState. Must be between SEED_BOUNDS.min_bound and SEED_BOUNDS.max_bound, inclusive.
- Raises
ValueError – If the input seed is not within the acceptable range.
- Returns
A numpy.random.RandomState instance.
- evalml.utils.gen_utils.get_time_index(X: pandas.DataFrame, y: pandas.Series, time_index_name: str)[source]#
Determines the column in the given data that should be used as the time index.
- evalml.utils.gen_utils.import_or_raise(library, error_msg=None, warning=False)[source]#
Attempts to import the requested library by name. If the import fails, raises an ImportError or warning.
- Parameters
library (str) – The name of the library.
error_msg (str) – Error message to return if the import fails.
warning (bool) – If True, import_or_raise gives a warning instead of ImportError. Defaults to False.
- Returns
Returns the library if importing succeeded.
- Raises
ImportError – If attempting to import the library fails because the library is not installed.
Exception – If importing the library fails.
- evalml.utils.gen_utils.is_all_numeric(df)[source]#
Checks if the given DataFrame contains only numeric values.
- Parameters
df (pd.DataFrame) – The DataFrame to check data types of.
- Returns
True if all the columns are numeric and are not missing any values, False otherwise.
- evalml.utils.gen_utils.is_categorical_actually_boolean(df, df_col)[source]#
Function to identify columns of a dataframe that contain True, False and null type.
The function is intended to be applied to columns that are identified as Categorical by the Imputer/SimpleImputer.
- Parameters
df (pandas.DataFrame) – Pandas dataframe with data.
df_col (str) – The column to identify as basically a nullable Boolean.
- Returns
Whether the column contains True, False and a null type.
- Return type
bool
- evalml.utils.gen_utils.jupyter_check()[source]#
Get whether or not the code is being run in a Ipython environment (such as Jupyter Notebook or Jupyter Lab).
- Returns
True if Ipython, False otherwise.
- Return type
boolean
- evalml.utils.gen_utils.logger#
- evalml.utils.gen_utils.pad_with_nans(pd_data, num_to_pad)[source]#
Pad the beginning num_to_pad rows with nans.
- Parameters
pd_data (pd.DataFrame or pd.Series) – Data to pad.
num_to_pad (int) – Number of nans to pad.
- Returns
pd.DataFrame or pd.Series
- evalml.utils.gen_utils.safe_repr(value)[source]#
Convert the given value into a string that can safely be used for repr.
- Parameters
value – The item to convert
- Returns
String representation of the value
- evalml.utils.gen_utils.save_plot(fig, filepath=None, format='png', interactive=False, return_filepath=False)[source]#
Saves fig to filepath if specified, or to a default location if not.
- Parameters
fig (Figure) – Figure to be saved.
filepath (str or Path, optional) – Location to save file. Default is with filename “test_plot”.
format (str) – Extension for figure to be saved as. Ignored if interactive is True and fig is of type plotly.Figure. Defaults to ‘png’.
interactive (bool, optional) – If True and fig is of type plotly.Figure, saves the fig as interactive instead of static, and format will be set to ‘html’. Defaults to False.
return_filepath (bool, optional) – Whether to return the final filepath the image is saved to. Defaults to False.
- Returns
String representing the final filepath the image was saved to if return_filepath is set to True. Defaults to None.
- evalml.utils.gen_utils.SEED_BOUNDS#
- evalml.utils.gen_utils.validate_holdout_datasets(X, X_train, pipeline_params)[source]#
Validate the holdout datasets match our expectations.
This function is run before calling predict in a time series pipeline. It verifies that X (the holdout set) is gap units away from the training set and is less than or equal to the forecast_horizon.
- Parameters
X (pd.DataFrame) – Data of shape [n_samples, n_features].
X_train (pd.DataFrame) – Training data.
pipeline_params (dict) – Dictionary of time series parameters with gap, forecast_horizon, and time_index being required.
- Returns
- TSHoldoutValidationResult - named tuple with three fields
is_valid (bool): True if holdout data is valid. error_messages (list): List of error messages to display. Empty if is_valid. error_codes (list): List of error codes to display. Empty if is_valid.