class_imbalance_data_check¶

Module Contents¶

Classes Summary¶

ClassImbalanceDataCheck

Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.

Contents¶

class evalml.data_checks.class_imbalance_data_check.ClassImbalanceDataCheck(threshold=0.1, min_samples=100, num_cv_folds=3)[source]¶

Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems.

Parameters

threshold (float) – The minimum threshold allowed for class imbalance before a warning is raised. This threshold is calculated by comparing the number of samples in each class to the sum of samples in that class and the majority class. For example, a multiclass case with [900, 900, 100] samples per classes 0, 1, and 2, respectively, would have a 0.10 threshold for class 2 (100 / (900 + 100)). Defaults to 0.10.
min_samples (int) – The minimum number of samples per accepted class. If the minority class is both below the threshold and min_samples, then we consider this severely imbalanced. Must be greater than 0. Defaults to 100.
num_cv_folds (int) – The number of cross-validation folds. Must be positive. Choose 0 to ignore this warning. Defaults to 3.

Methods

`name`	Returns a name describing the data check.
`validate`	Checks if any target labels are imbalanced beyond a threshold for binary and multiclass problems

name(cls)¶: Returns a name describing the data check.

validate(self, X, y)[source]¶

Checks if any target labels are imbalanced beyond a threshold for binary and multiclass problems: Ignores NaN values in target labels if they appear.

Parameters

X (pd.DataFrame, np.ndarray) – Features. Ignored.
y (pd.Series, np.ndarray) – Target labels to check for imbalanced data.

Returns

Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold,: and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds.

Return type

dict

Example

>>> import pandas as pd
>>> X = pd.DataFrame()
>>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> target_check = ClassImbalanceDataCheck(threshold=0.10)
>>> assert target_check.validate(X, y) == {"errors": [{"message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "error",                                                                   "code": "CLASS_IMBALANCE_BELOW_FOLDS",                                                                   "details": {"target_values": [0]}}],                                                     "warnings": [{"message": "The following labels fall below 10% of the target: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "warning",                                                                   "code": "CLASS_IMBALANCE_BELOW_THRESHOLD",                                                                   "details": {"target_values": [0]}},                                                                   {"message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]",                                                                   "data_check_name": "ClassImbalanceDataCheck",                                                                   "level": "warning",                                                                   "code": "CLASS_IMBALANCE_SEVERE",                                                                   "details": {"target_values": [0]}}],                                                     "actions": []}

Data Checks data_check