evalml.data_checks.ClassImbalanceDataCheck.validate

ClassImbalanceDataCheck.validate(X, y)[source]
Checks if any target labels are imbalanced beyond a threshold for binary and multiclass problems

Ignores NaN values in target labels if they appear.

Parameters
  • X (ww.DataTable, pd.DataFrame, np.ndarray) – Features. Ignored.

  • y (ww.DataColumn, pd.Series, np.ndarray) – Target labels to check for imbalanced data.

Returns

Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold,

and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.

Return type

dict

Example

>>> import pandas as pd
>>> X = pd.DataFrame()
>>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> target_check = ClassImbalanceDataCheck(threshold=0.10)
>>> assert target_check.validate(X, y) == {"errors": [{"message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "error",                                                                   "code": "CLASS_IMBALANCE_BELOW_FOLDS",                                                                   "details": {"target_values": [0]}}],                                                     "warnings": [{"message": "The following labels fall below 10% of the target: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "warning",                                                                   "code": "CLASS_IMBALANCE_BELOW_THRESHOLD",                                                                   "details": {"target_values": [0]}}]}