evalml.data_checks.ClassImbalanceDataCheck.validate

ClassImbalanceDataCheck.validate(X, y)[source]
Checks if any target labels are imbalanced beyond a threshold for binary and multiclass problems

Ignores NaN values in target labels if they appear.

Parameters
  • X (ww.DataTable, pd.DataFrame, np.ndarray) – Features. Ignored.

  • y (ww.DataColumn, pd.Series, np.ndarray) – Target labels to check for imbalanced data.

Returns

Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold,

and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.

Return type

dict

Example

>>> import pandas as pd
>>> X = pd.DataFrame()
>>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> target_check = ClassImbalanceDataCheck(threshold=0.10)
>>> assert target_check.validate(X, y) == {"errors": [{"message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "error",                                                                   "code": "CLASS_IMBALANCE_BELOW_FOLDS",                                                                   "details": {"target_values": [0]}}],                                                     "warnings": [{"message": "The following labels fall below 10% of the target: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "warning",                                                                   "code": "CLASS_IMBALANCE_BELOW_THRESHOLD",                                                                   "details": {"target_values": [0]}},                                                                   {"message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "warning",                                                                   "code": "CLASS_IMBALANCE_SEVERE",                                                                   "details": {"target_values": [0]}}],                                                     "actions": []}