default_data_checks

A default set of data checks that can be used for a variety of datasets.

Module Contents

Classes Summary

DefaultDataChecks

A collection of basic data checks that is used by AutoML by default.

Contents

class evalml.data_checks.default_data_checks.DefaultDataChecks(problem_type, objective, n_splits=3, problem_configuration=None)[source]

A collection of basic data checks that is used by AutoML by default.

Includes:

  • NullDataCheck

  • HighlyNullRowsDataCheck

  • IDColumnsDataCheck

  • TargetLeakageDataCheck

  • InvalidTargetDataCheck

  • NoVarianceDataCheck

  • ClassImbalanceDataCheck (for classification problem types)

  • TargetDistributionDataCheck (for regression problem types)

  • DateTimeFormatDataCheck (for time series problem types)

  • ‘TimeSeriesParametersDataCheck’ (for time series problem types)

  • TimeSeriesSplittingDataCheck (for time series classification problem types)

Parameters
  • problem_type (str) – The problem type that is being validated. Can be regression, binary, or multiclass.

  • objective (str or ObjectiveBase) – Name or instance of the objective class.

  • n_splits (int) – The number of splits as determined by the data splitter being used. Defaults to 3.

  • datetime_column (str) – The name of the column containing datetime information to be used for time series problems.

  • to "index" indicating that the datetime information is in the index of X or y. (Default) –

Methods

validate

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

validate(self, X, y=None)

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

Parameters
  • X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]

  • y (pd.Series, np.ndarray) – The target data of length [n_samples]

Returns

Dictionary containing DataCheckMessage objects

Return type

dict