default_data_checks#

A default set of data checks that can be used for a variety of datasets.

Module Contents#

Classes Summary#

DefaultDataChecks

A collection of basic data checks that is used by AutoML by default.

Contents#

class evalml.data_checks.default_data_checks.DefaultDataChecks(problem_type, objective, n_splits=3, problem_configuration=None)[source]#

A collection of basic data checks that is used by AutoML by default.

Includes:

  • NullDataCheck

  • HighlyNullRowsDataCheck

  • IDColumnsDataCheck

  • TargetLeakageDataCheck

  • InvalidTargetDataCheck

  • NoVarianceDataCheck

  • ClassImbalanceDataCheck (for classification problem types)

  • TargetDistributionDataCheck (for regression problem types)

  • DateTimeFormatDataCheck (for time series problem types)

  • ‘TimeSeriesParametersDataCheck’ (for time series problem types)

  • TimeSeriesSplittingDataCheck (for time series classification problem types)

Parameters
  • problem_type (str) – The problem type that is being validated. Can be regression, binary, or multiclass.

  • objective (str or ObjectiveBase) – Name or instance of the objective class.

  • n_splits (int) – The number of splits as determined by the data splitter being used. Defaults to 3.

  • problem_configuration (dict) – Required for time series problem types. Values should be passed in for time_index,

  • gap

  • forecast_horizon

  • max_delay. (and) –

Methods

validate

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

validate(self, X, y=None)#

Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.

Parameters
  • X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]

  • y (pd.Series, np.ndarray) – The target data of length [n_samples]

Returns

Dictionary containing DataCheckMessage objects

Return type

dict