invalid_target_data_check ====================================================== .. py:module:: evalml.data_checks.invalid_target_data_check .. autoapi-nested-parse:: Data check that checks if the target data contains missing or invalid values. Module Contents --------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.invalid_target_data_check.InvalidTargetDataCheck Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: InvalidTargetDataCheck(problem_type, objective, n_unique=100, null_strategy='drop') Check if the target data is considered invalid. Target data is considered invalid if: - Target is None. - Target has NaN or None values. - Target is of an unsupported Woodwork logical type. - Target and features have different lengths or indices. - Target does not have enough instances of a class in a classification problem. - Target does not contain numeric data for regression problems. :param problem_type: The specific problem type to data check for. e.g. 'binary', 'multiclass', 'regression, 'time series regression' :type problem_type: str or ProblemTypes :param objective: Name or instance of the objective class. :type objective: str or ObjectiveBase :param n_unique: Number of unique target values to store when problem type is binary and target incorrectly has more than 2 unique values. Non-negative integer. If None, stores all unique values. Defaults to 100. :type n_unique: int :param null_strategy: The type of action option that should be returned if the target is partially null. The options are `impute` and `drop` (default). `impute` - Will return a `DataCheckActionOption` for imputing the target column. `drop` - Will return a `DataCheckActionOption` for dropping the null rows in the target column. :type null_strategy: str **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **multiclass_continuous_threshold** - 0.05 **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.invalid_target_data_check.InvalidTargetDataCheck.name evalml.data_checks.invalid_target_data_check.InvalidTargetDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Check if the target data is considered invalid. If the input features argument is not None, it will be used to check that the target and features have the same dimensions and indices. Target data is considered invalid if: - Target is None. - Target has NaN or None values. - Target is of an unsupported Woodwork logical type. - Target and features have different lengths or indices. - Target does not have enough instances of a class in a classification problem. - Target does not contain numeric data for regression problems. :param X: Features. If not None, will be used to check that the target and features have the same dimensions and indices. :type X: pd.DataFrame, np.ndarray :param y: Target data to check for invalid values. :type y: pd.Series, np.ndarray :returns: List with DataCheckErrors if any invalid values are found in the target data. :rtype: dict (DataCheckError) .. rubric:: Examples >>> import pandas as pd Target values must be integers, doubles, or booleans. >>> X = pd.DataFrame({"col": [1, 2, 3, 1]}) >>> y = pd.Series(["cat_1", "cat_2", "cat_1", "cat_2"]) >>> target_check = InvalidTargetDataCheck("regression", "R2", null_strategy="impute") >>> assert target_check.validate(X, y) == [ ... { ... "message": "Target is unsupported Unknown type. Valid Woodwork logical types include: integer, double, boolean, age, age_fractional, integer_nullable, boolean_nullable, age_nullable", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None, "unsupported_type": "unknown"}, ... "code": "TARGET_UNSUPPORTED_TYPE", ... "action_options": [] ... }, ... { ... "message": "Target data type should be numeric for regression type problems.", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "TARGET_UNSUPPORTED_TYPE_REGRESSION", ... "action_options": [] ... } ... ] The target cannot have null values. >>> y = pd.Series([None, pd.NA, pd.NaT, None]) >>> assert target_check.validate(X, y) == [ ... { ... "message": "Target is either empty or fully null.", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "TARGET_IS_EMPTY_OR_FULLY_NULL", ... "action_options": [] ... } ... ] ... ... >>> y = pd.Series([1, None, 3, None]) >>> assert target_check.validate(None, y) == [ ... { ... "message": "2 row(s) (50.0%) of target values are null", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": { ... "columns": None, ... "rows": [1, 3], ... "num_null_rows": 2, ... "pct_null_rows": 50.0 ... }, ... "code": "TARGET_HAS_NULL", ... "action_options": [ ... { ... "code": "IMPUTE_COL", ... "data_check_name": "InvalidTargetDataCheck", ... "parameters": { ... "impute_strategy": { ... "parameter_type": "global", ... "type": "category", ... "categories": ["mean", "most_frequent"], ... "default_value": "mean" ... } ... }, ... "metadata": {"columns": None, "rows": None, "is_target": True}, ... } ... ], ... } ... ] If the target values don't match the problem type passed, an error will be raised. In this instance, only two values exist in the target column, but multiclass has been passed as the problem type. >>> X = pd.DataFrame([i for i in range(50)]) >>> y = pd.Series([i%2 for i in range(50)]) >>> target_check = InvalidTargetDataCheck("multiclass", "Log Loss Multiclass") >>> assert target_check.validate(X, y) == [ ... { ... "message": "Target has two or less classes, which is too few for multiclass problems. Consider changing to binary.", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None, "num_classes": 2}, ... "code": "TARGET_MULTICLASS_NOT_ENOUGH_CLASSES", ... "action_options": [] ... } ... ] If the length of X and y differ, a warning will be raised. A warning will also be raised for indices that don"t match. >>> target_check = InvalidTargetDataCheck("regression", "R2") >>> X = pd.DataFrame([i for i in range(5)]) >>> y = pd.Series([1, 2, 4, 3], index=[1, 2, 4, 3]) >>> assert target_check.validate(X, y) == [ ... { ... "message": "Input target and features have different lengths", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "warning", ... "details": {"columns": None, "rows": None, "features_length": 5, "target_length": 4}, ... "code": "MISMATCHED_LENGTHS", ... "action_options": [] ... }, ... { ... "message": "Input target and features have mismatched indices. Details will include the first 10 mismatched indices.", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "warning", ... "details": { ... "columns": None, ... "rows": None, ... "indices_not_in_features": [], ... "indices_not_in_target": [0] ... }, ... "code": "MISMATCHED_INDICES", ... "action_options": [] ... } ... ]