class_imbalance_data_check ======================================================= .. py:module:: evalml.data_checks.class_imbalance_data_check .. autoapi-nested-parse:: Data check that checks if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems. Module Contents --------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.class_imbalance_data_check.ClassImbalanceDataCheck Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: ClassImbalanceDataCheck(threshold=0.1, min_samples=100, num_cv_folds=3, test_size=None) Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems. :param threshold: The minimum threshold allowed for class imbalance before a warning is raised. This threshold is calculated by comparing the number of samples in each class to the sum of samples in that class and the majority class. For example, a multiclass case with [900, 900, 100] samples per classes 0, 1, and 2, respectively, would have a 0.10 threshold for class 2 (100 / (900 + 100)). Defaults to 0.10. :type threshold: float :param min_samples: The minimum number of samples per accepted class. If the minority class is both below the threshold and min_samples, then we consider this severely imbalanced. Must be greater than 0. Defaults to 100. :type min_samples: int :param num_cv_folds: The number of cross-validation folds. Must be positive. Choose 0 to ignore this warning. Defaults to 3. :type num_cv_folds: int :param test_size: Percentage of test set size. Used to calculate class imbalance prior to splitting the data into training and validation/test sets. :type test_size: None, float, int :raises ValueError: If threshold is not within 0 and 0.5 :raises ValueError: If min_samples is not greater than 0 :raises ValueError: If number of cv folds is negative :raises ValueError: If test_size is not between 0 and 1 **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.class_imbalance_data_check.ClassImbalanceDataCheck.name evalml.data_checks.class_imbalance_data_check.ClassImbalanceDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Check if any target labels are imbalanced beyond a threshold for binary and multiclass problems. Ignores NaN values in target labels if they appear. :param X: Features. Ignored. :type X: pd.DataFrame, np.ndarray :param y: Target labels to check for imbalanced data. :type y: pd.Series, np.ndarray :returns: Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold, and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds. :rtype: dict .. rubric:: Examples >>> import pandas as pd ... >>> X = pd.DataFrame() >>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) In this binary example, the target class 0 is present in fewer than 10% (threshold=0.10) of instances, and fewer than 2 * the number of cross folds (2 * 3 = 6). Therefore, both a warning and an error are returned as part of the Class Imbalance Data Check. In addition, if a target is present with fewer than `min_samples` occurrences (default is 100) and is under the threshold, a severe class imbalance warning will be raised. >>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.10) >>> assert class_imb_dc.validate(X, y) == [ ... { ... "message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "error", ... "code": "CLASS_IMBALANCE_BELOW_FOLDS", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... }, ... { ... "message": "The following labels fall below 10% of the target: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_BELOW_THRESHOLD", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... }, ... { ... "message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_SEVERE", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... } ... ] In this multiclass example, the target class 0 is present in fewer than 30% of observations, however with 1 cv fold, the minimum number of instances required is 2 * 1 = 2. Therefore a warning, but not an error, is raised. >>> y = pd.Series([0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2]) >>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.30, min_samples=5, num_cv_folds=1) >>> assert class_imb_dc.validate(X, y) == [ ... { ... "message": "The following labels fall below 30% of the target: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_BELOW_THRESHOLD", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... }, ... { ... "message": "The following labels in the target have severe class imbalance because they fall under 30% of the target and have less than 5 samples: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_SEVERE", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... } ... ] ... >>> y = pd.Series([0, 0, 1, 1, 1, 1, 2, 2, 2, 2]) >>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.30, num_cv_folds=1) >>> assert class_imb_dc.validate(X, y) == []