evalml.data_checks.ClassImbalanceDataCheck.validate¶

ClassImbalanceDataCheck.validate(X, y)[source]¶

Checks if any target labels are imbalanced beyond a threshold for binary and multiclass problems: Ignores NaN values in target labels if they appear.

Parameters

X (ww.DataTable, pd.DataFrame, np.ndarray) – Features. Ignored.
y (ww.DataColumn, pd.Series, np.ndarray) – Target labels to check for imbalanced data.

Returns

Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold,: and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.

Return type

dict

Example

>>> import pandas as pd
>>> X = pd.DataFrame()
>>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> target_check = ClassImbalanceDataCheck(threshold=0.10)
>>> assert target_check.validate(X, y) == {"errors": [{"message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "error",                                                                   "code": "CLASS_IMBALANCE_BELOW_FOLDS",                                                                   "details": {"target_values": [0]}}],                                                     "warnings": [{"message": "The following labels fall below 10% of the target: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "warning",                                                                   "code": "CLASS_IMBALANCE_BELOW_THRESHOLD",                                                                   "details": {"target_values": [0]}},                                                                   {"message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "warning",                                                                   "code": "CLASS_IMBALANCE_SEVERE",                                                                   "details": {"target_values": [0]}}],                                                     "actions": []}

evalml.data_checks.ClassImbalanceDataCheck.__init__ evalml.data_checks.MulticollinearityDataCheck