default_data_checks#
A default set of data checks that can be used for a variety of datasets.
Module Contents#
Classes Summary#
A collection of basic data checks that is used by AutoML by default. |
Contents#
- class evalml.data_checks.default_data_checks.DefaultDataChecks(problem_type, objective, n_splits=3, problem_configuration=None)[source]#
A collection of basic data checks that is used by AutoML by default.
Includes:
NullDataCheck
HighlyNullRowsDataCheck
IDColumnsDataCheck
TargetLeakageDataCheck
InvalidTargetDataCheck
NoVarianceDataCheck
ClassImbalanceDataCheck (for classification problem types)
TargetDistributionDataCheck (for regression problem types)
DateTimeFormatDataCheck (for time series problem types)
‘TimeSeriesParametersDataCheck’ (for time series problem types)
TimeSeriesSplittingDataCheck (for time series classification problem types)
- Parameters
problem_type (str) – The problem type that is being validated. Can be regression, binary, or multiclass.
objective (str or ObjectiveBase) – Name or instance of the objective class.
n_splits (int) – The number of splits as determined by the data splitter being used. Defaults to 3.
problem_configuration (dict) – Required for time series problem types. Values should be passed in for time_index,
gap –
forecast_horizon –
max_delay. (and) –
Methods
Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.
- validate(self, X, y=None)#
Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable.
- Parameters
X (pd.DataFrame, np.ndarray) – The input data of shape [n_samples, n_features]
y (pd.Series, np.ndarray) – The target data of length [n_samples]
- Returns
Dictionary containing DataCheckMessage objects
- Return type
dict