evalml.data_checks.ClassImbalanceDataCheck.validate

ClassImbalanceDataCheck.validate(X, y)[source]

Checks if any target labels are imbalanced beyond a threshold for binary and multiclass problems Ignores nan values in target labels if they appear

Parameters
  • X (pd.DataFrame, pd.Series, np.array, list) – Features. Ignored.

  • y – Target labels to check for imbalanced data.

Returns

list with DataCheckWarnings if imbalance in classes is less than the threshold,

and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.

Return type

list (DataCheckWarning, DataCheckError)

Example

>>> X = pd.DataFrame({})
>>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> target_check = ClassImbalanceDataCheck(threshold=0.10)
>>> assert target_check.validate(X, y) == [DataCheckError("The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]", "ClassImbalanceDataCheck"), DataCheckWarning("The following labels fall below 10% of the target: [0]", "ClassImbalanceDataCheck")]