utils ============================= .. py:module:: evalml.automl.utils .. autoapi-nested-parse:: Utilities useful in AutoML. Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: :nosignatures: evalml.automl.utils.check_all_pipeline_names_unique evalml.automl.utils.get_best_sampler_for_data evalml.automl.utils.get_default_primary_search_objective evalml.automl.utils.get_pipelines_from_component_graphs evalml.automl.utils.get_threshold_tuning_info evalml.automl.utils.make_data_splitter evalml.automl.utils.resplit_training_data evalml.automl.utils.tune_binary_threshold Attributes Summary ~~~~~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.automl.utils.AutoMLConfig Contents ~~~~~~~~~~~~~~~~~~~ .. py:data:: AutoMLConfig .. py:function:: check_all_pipeline_names_unique(pipelines) Checks whether all the pipeline names are unique. :param pipelines: List of pipelines to check if all names are unique. :type pipelines: list[PipelineBase] :raises ValueError: If any pipeline names are duplicated. .. py:function:: get_best_sampler_for_data(X, y, sampler_method, sampler_balanced_ratio) Returns the name of the sampler component to use for AutoMLSearch. :param X: The input feature data :type X: pd.DataFrame :param y: The input target data :type y: pd.Series :param sampler_method: The sampler_type argument passed to AutoMLSearch :type sampler_method: str :param sampler_balanced_ratio: The ratio of min:majority targets that we would consider balanced, or should balance the classes to. :type sampler_balanced_ratio: float :returns: The string name of the sampling component to use, or None if no sampler is necessary :rtype: str, None .. py:function:: get_default_primary_search_objective(problem_type) Get the default primary search objective for a problem type. :param problem_type: Problem type of interest. :type problem_type: str or ProblemType :returns: primary objective instance for the problem type. :rtype: ObjectiveBase .. py:function:: get_pipelines_from_component_graphs(component_graphs_dict, problem_type, parameters=None, random_seed=0) Returns created pipelines from passed component graphs based on the specified problem type. :param component_graphs_dict: The dict of component graphs. :type component_graphs_dict: dict :param problem_type: The problem type for which pipelines will be created. :type problem_type: str or ProblemType :param parameters: Pipeline-level parameters that should be passed to the proposed pipelines. Defaults to None. :type parameters: dict :param random_seed: Random seed. Defaults to 0. :type random_seed: int :returns: List of pipelines made from the passed component graphs. :rtype: list .. py:function:: get_threshold_tuning_info(automl_config, pipeline) Determine for a given automl config and pipeline what the threshold tuning objective should be and whether or not training data should be further split to achieve proper threshold tuning. Can also be used after automl search has been performed to determine whether the full training data was used to train the pipeline. :param automl_config: The AutoMLSearch's config object. Used to determine threshold tuning objective and whether data needs resplitting. :type automl_config: AutoMLConfig :param pipeline: The pipeline instance to Threshold. :type pipeline: Pipeline :returns: threshold_tuning_objective, data_needs_resplitting (str, bool) .. py:function:: make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=3, shuffle=True, random_seed=0) Given the training data and ML problem parameters, compute a data splitting method to use during AutoML search. :param X: The input training data of shape [n_samples, n_features]. :type X: pd.DataFrame :param y: The target training data of length [n_samples]. :type y: pd.Series :param problem_type: The type of machine learning problem. :type problem_type: ProblemType :param problem_configuration: Additional parameters needed to configure the search. For example, in time series problems, values should be passed in for the time_index, gap, and max_delay variables. Defaults to None. :type problem_configuration: dict, None :param n_splits: The number of CV splits, if applicable. Defaults to 3. :type n_splits: int, None :param shuffle: Whether or not to shuffle the data before splitting, if applicable. Defaults to True. :type shuffle: bool :param random_seed: Seed for the random number generator. Defaults to 0. :type random_seed: int :returns: Data splitting method. :rtype: sklearn.model_selection.BaseCrossValidator :raises ValueError: If problem_configuration is not given for a time-series problem. .. py:function:: resplit_training_data(pipeline, X_train, y_train) Further split the training data for a given pipeline. This is needed for binary pipelines in order to properly tune the threshold. Can be used after automl search has been performed to recreate the data that was used to train a pipeline. :param pipeline: the pipeline whose training data we are splitting :type pipeline: PipelineBase :param X_train: training data of shape [n_samples, n_features] :type X_train: pd.DataFrame or np.ndarray :param y_train: training target data of length [n_samples] :type y_train: pd.Series, or np.ndarray :returns: Feature and target data each split into train and threshold tuning sets. :rtype: pd.DataFrame, pd.DataFrame, pd.Series, pd.Series .. py:function:: tune_binary_threshold(pipeline, objective, problem_type, X_threshold_tuning, y_threshold_tuning, X=None, y=None) Tunes the threshold of a binary pipeline to the X and y thresholding data. :param pipeline: Pipeline instance to threshold. :type pipeline: Pipeline :param objective: The objective we want to tune with. If not tuneable and best_pipeline is True, will use F1. :type objective: ObjectiveBase :param problem_type: The problem type of the pipeline. :type problem_type: ProblemType :param X_threshold_tuning: Features to which the pipeline will be tuned. :type X_threshold_tuning: pd.DataFrame :param y_threshold_tuning: Target data to which the pipeline will be tuned. :type y_threshold_tuning: pd.Series :param X: Features to which the pipeline will be trained (used for time series binary). Defaults to None. :type X: pd.DataFrame :param y: Target to which the pipeline will be trained (used for time series binary). Defaults to None. :type y: pd.Series