Data Checks ============================ .. py:module:: evalml.data_checks .. autoapi-nested-parse:: Data checks. Submodules ---------- .. toctree:: :titlesonly: :maxdepth: 1 class_imbalance_data_check/index.rst data_check/index.rst data_check_action/index.rst data_check_action_code/index.rst data_check_action_option/index.rst data_check_message/index.rst data_check_message_code/index.rst data_check_message_type/index.rst data_checks/index.rst datetime_format_data_check/index.rst default_data_checks/index.rst id_columns_data_check/index.rst invalid_target_data_check/index.rst mismatched_series_length_data_check/index.rst multicollinearity_data_check/index.rst no_variance_data_check/index.rst null_data_check/index.rst outliers_data_check/index.rst sparsity_data_check/index.rst target_distribution_data_check/index.rst target_leakage_data_check/index.rst ts_parameters_data_check/index.rst ts_splitting_data_check/index.rst uniqueness_data_check/index.rst utils/index.rst Package Contents ---------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.ClassImbalanceDataCheck evalml.data_checks.DataCheck evalml.data_checks.DataCheckAction evalml.data_checks.DataCheckActionCode evalml.data_checks.DataCheckActionOption evalml.data_checks.DataCheckError evalml.data_checks.DataCheckMessage evalml.data_checks.DataCheckMessageCode evalml.data_checks.DataCheckMessageType evalml.data_checks.DataChecks evalml.data_checks.DataCheckWarning evalml.data_checks.DateTimeFormatDataCheck evalml.data_checks.DCAOParameterAllowedValuesType evalml.data_checks.DCAOParameterType evalml.data_checks.DefaultDataChecks evalml.data_checks.IDColumnsDataCheck evalml.data_checks.InvalidTargetDataCheck evalml.data_checks.MismatchedSeriesLengthDataCheck evalml.data_checks.MulticollinearityDataCheck evalml.data_checks.NoVarianceDataCheck evalml.data_checks.NullDataCheck evalml.data_checks.OutliersDataCheck evalml.data_checks.SparsityDataCheck evalml.data_checks.TargetDistributionDataCheck evalml.data_checks.TargetLeakageDataCheck evalml.data_checks.TimeSeriesParametersDataCheck evalml.data_checks.TimeSeriesSplittingDataCheck evalml.data_checks.UniquenessDataCheck Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: ClassImbalanceDataCheck(threshold=0.1, min_samples=100, num_cv_folds=3, test_size=None) Check if any of the target labels are imbalanced, or if the number of values for each target are below 2 times the number of CV folds. Use for classification problems. :param threshold: The minimum threshold allowed for class imbalance before a warning is raised. This threshold is calculated by comparing the number of samples in each class to the sum of samples in that class and the majority class. For example, a multiclass case with [900, 900, 100] samples per classes 0, 1, and 2, respectively, would have a 0.10 threshold for class 2 (100 / (900 + 100)). Defaults to 0.10. :type threshold: float :param min_samples: The minimum number of samples per accepted class. If the minority class is both below the threshold and min_samples, then we consider this severely imbalanced. Must be greater than 0. Defaults to 100. :type min_samples: int :param num_cv_folds: The number of cross-validation folds. Must be positive. Choose 0 to ignore this warning. Defaults to 3. :type num_cv_folds: int :param test_size: Percentage of test set size. Used to calculate class imbalance prior to splitting the data into training and validation/test sets. :type test_size: None, float, int :raises ValueError: If threshold is not within 0 and 0.5 :raises ValueError: If min_samples is not greater than 0 :raises ValueError: If number of cv folds is negative :raises ValueError: If test_size is not between 0 and 1 **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.ClassImbalanceDataCheck.name evalml.data_checks.ClassImbalanceDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Check if any target labels are imbalanced beyond a threshold for binary and multiclass problems. Ignores NaN values in target labels if they appear. :param X: Features. Ignored. :type X: pd.DataFrame, np.ndarray :param y: Target labels to check for imbalanced data. :type y: pd.Series, np.ndarray :returns: Dictionary with DataCheckWarnings if imbalance in classes is less than the threshold, and DataCheckErrors if the number of values for each target is below 2 * num_cv_folds. :rtype: dict .. rubric:: Examples >>> import pandas as pd ... >>> X = pd.DataFrame() >>> y = pd.Series([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) In this binary example, the target class 0 is present in fewer than 10% (threshold=0.10) of instances, and fewer than 2 * the number of cross folds (2 * 3 = 6). Therefore, both a warning and an error are returned as part of the Class Imbalance Data Check. In addition, if a target is present with fewer than `min_samples` occurrences (default is 100) and is under the threshold, a severe class imbalance warning will be raised. >>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.10) >>> assert class_imb_dc.validate(X, y) == [ ... { ... "message": "The number of instances of these targets is less than 2 * the number of cross folds = 6 instances: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "error", ... "code": "CLASS_IMBALANCE_BELOW_FOLDS", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... }, ... { ... "message": "The following labels fall below 10% of the target: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_BELOW_THRESHOLD", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... }, ... { ... "message": "The following labels in the target have severe class imbalance because they fall under 10% of the target and have less than 100 samples: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_SEVERE", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... } ... ] In this multiclass example, the target class 0 is present in fewer than 30% of observations, however with 1 cv fold, the minimum number of instances required is 2 * 1 = 2. Therefore a warning, but not an error, is raised. >>> y = pd.Series([0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2]) >>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.30, min_samples=5, num_cv_folds=1) >>> assert class_imb_dc.validate(X, y) == [ ... { ... "message": "The following labels fall below 30% of the target: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_BELOW_THRESHOLD", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... }, ... { ... "message": "The following labels in the target have severe class imbalance because they fall under 30% of the target and have less than 5 samples: [0]", ... "data_check_name": "ClassImbalanceDataCheck", ... "level": "warning", ... "code": "CLASS_IMBALANCE_SEVERE", ... "details": {"target_values": [0], "rows": None, "columns": None}, ... "action_options": [] ... } ... ] ... >>> y = pd.Series([0, 0, 1, 1, 1, 1, 2, 2, 2, 2]) >>> class_imb_dc = ClassImbalanceDataCheck(threshold=0.30, num_cv_folds=1) >>> assert class_imb_dc.validate(X, y) == [] .. py:class:: DataCheck Base class for all data checks. Data checks are a set of heuristics used to determine if there are problems with input data. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheck.name evalml.data_checks.DataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) :abstractmethod: Inspect and validate the input data, runs any necessary calculations or algorithms, and returns a list of warnings and errors if applicable. :param X: The input data of shape [n_samples, n_features] :type X: pd.DataFrame :param y: The target data of length [n_samples] :type y: pd.Series, optional :returns: Dictionary of DataCheckError and DataCheckWarning messages :rtype: dict (DataCheckMessage) .. py:class:: DataCheckAction(action_code, data_check_name, metadata=None) A recommended action returned by a DataCheck. :param action_code: Action code associated with the action. :type action_code: str, DataCheckActionCode :param data_check_name: Name of data check. :type data_check_name: str :param metadata: Additional useful information associated with the action. Defaults to None. :type metadata: dict, optional **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheckAction.convert_dict_to_action evalml.data_checks.DataCheckAction.to_dict .. py:method:: convert_dict_to_action(action_dict) :staticmethod: Convert a dictionary into a DataCheckAction. :param action_dict: Dictionary to convert into action. Should have keys "code", "data_check_name", and "metadata". :raises ValueError: If input dictionary does not have keys `code` and `metadata` and if the `metadata` dictionary does not have keys `columns` and `rows`. :returns: DataCheckAction object from the input dictionary. .. py:method:: to_dict(self) Return a dictionary form of the data check action. .. py:class:: DataCheckActionCode Enum for data check action code. **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **DROP_COL** - Action code for dropping a column. * - **DROP_ROWS** - Action code for dropping rows. * - **IMPUTE_COL** - Action code for imputing a column. * - **REGULARIZE_AND_IMPUTE_DATASET** - Action code for regularizing and imputing all features and target time series data. * - **SET_FIRST_COL_ID** - Action code for setting the first column as an id column. * - **TRANSFORM_TARGET** - Action code for transforming the target data. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheckActionCode.name evalml.data_checks.DataCheckActionCode.value .. py:method:: name(self) The name of the Enum member. .. py:method:: value(self) The value of the Enum member. .. py:class:: DataCheckActionOption(action_code, data_check_name, parameters=None, metadata=None) A recommended action option returned by a DataCheck. It contains an action code that indicates what the action should be, a data check name that indicates what data check was used to generate the action, and parameters and metadata which can be used to further refine the action. :param action_code: Action code associated with the action option. :type action_code: DataCheckActionCode :param data_check_name: Name of the data check that produced this option. :type data_check_name: str :param parameters: Parameters associated with the action option. Defaults to None. :type parameters: dict :param metadata: Additional useful information associated with the action option. Defaults to None. :type metadata: dict, optional .. rubric:: Examples >>> parameters = { ... "global_parameter_name": { ... "parameter_type": "global", ... "type": "float", ... "default_value": 0.0, ... }, ... "column_parameter_name": { ... "parameter_type": "column", ... "columns": { ... "a": { ... "impute_strategy": { ... "categories": ["mean", "most_frequent"], ... "type": "category", ... "default_value": "mean", ... }, ... "constant_fill_value": {"type": "float", "default_value": 0}, ... }, ... }, ... }, ... } >>> data_check_action = DataCheckActionOption(DataCheckActionCode.DROP_COL, None, metadata={}, parameters=parameters) **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheckActionOption.convert_dict_to_option evalml.data_checks.DataCheckActionOption.get_action_from_defaults evalml.data_checks.DataCheckActionOption.to_dict .. py:method:: convert_dict_to_option(action_dict) :staticmethod: Convert a dictionary into a DataCheckActionOption. :param action_dict: Dictionary to convert into an action option. Should have keys "code", "data_check_name", and "metadata". :raises ValueError: If input dictionary does not have keys `code` and `metadata` and if the `metadata` dictionary does not have keys `columns` and `rows`. :returns: DataCheckActionOption object from the input dictionary. .. py:method:: get_action_from_defaults(self) Returns an action based on the defaults parameters. :returns: An based on the defaults parameters the option. :rtype: DataCheckAction .. py:method:: to_dict(self) Return a dictionary form of the data check action option. .. py:class:: DataCheckError(message, data_check_name, message_code=None, details=None, action_options=None) DataCheckMessage subclass for errors returned by data checks. **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **message_type** - DataCheckMessageType.ERROR **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheckError.to_dict .. py:method:: to_dict(self) Return a dictionary form of the data check message. .. py:class:: DataCheckMessage(message, data_check_name, message_code=None, details=None, action_options=None) Base class for a message returned by a DataCheck, tagged by name. :param message: Message string. :type message: str :param data_check_name: Name of the associated data check. :type data_check_name: str :param message_code: Message code associated with the message. Defaults to None. :type message_code: DataCheckMessageCode, optional :param details: Additional useful information associated with the message. Defaults to None. :type details: dict, optional :param action_options: A list of `DataCheckActionOption`s associated with the message. Defaults to None. :type action_options: list, optional **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **message_type** - None **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheckMessage.to_dict .. py:method:: to_dict(self) Return a dictionary form of the data check message. .. py:class:: DataCheckMessageCode Enum for data check message code. **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **CLASS_IMBALANCE_BELOW_FOLDS** - Message code for when the number of values for each target is below 2 * number of CV folds. * - **CLASS_IMBALANCE_BELOW_THRESHOLD** - Message code for when balance in classes is less than the threshold. * - **CLASS_IMBALANCE_SEVERE** - Message code for when balance in classes is less than the threshold and minimum class is less than minimum number of accepted samples. * - **COLS_WITH_NULL** - Message code for columns with null values. * - **DATETIME_HAS_MISALIGNED_VALUES** - Message code for when datetime information has values that are not aligned with the inferred frequency. * - **DATETIME_HAS_NAN** - Message code for when input datetime columns contain NaN values. * - **DATETIME_HAS_REDUNDANT_ROW** - Message code for when datetime information has more than one row per datetime. * - **DATETIME_HAS_UNEVEN_INTERVALS** - Message code for when the datetime values have uneven intervals. * - **DATETIME_INFORMATION_NOT_FOUND** - Message code for when datetime information can not be found or is in an unaccepted format. * - **DATETIME_IS_MISSING_VALUES** - Message code for when datetime feature has values missing between the start and end dates. * - **DATETIME_IS_NOT_MONOTONIC** - Message code for when the datetime values are not monotonically increasing. * - **DATETIME_NO_FREQUENCY_INFERRED** - Message code for when no frequency can be inferred in the datetime values through Woodwork's infer_frequency. * - **HAS_ID_COLUMN** - Message code for data that has ID columns. * - **HAS_ID_FIRST_COLUMN** - Message code for data that has an ID column as the first column. * - **HAS_OUTLIERS** - Message code for when outliers are detected. * - **HIGH_VARIANCE** - Message code for when high variance is detected for cross-validation. * - **HIGHLY_NULL_COLS** - Message code for highly null columns. * - **HIGHLY_NULL_ROWS** - Message code for highly null rows. * - **INVALID_SERIES_ID_COL** - Message code for when given series_id is invalid * - **IS_MULTICOLLINEAR** - Message code for when data is potentially multicollinear. * - **MISMATCHED_INDICES** - Message code for when input target and features have mismatched indices. * - **MISMATCHED_INDICES_ORDER** - Message code for when input target and features have mismatched indices order. The two inputs have the same index values, but shuffled. * - **MISMATCHED_LENGTHS** - Message code for when input target and features have different lengths. * - **MISMATCHED_SERIES_LENGTH** - Message code for when one or more unique series in a multiseries dataset is of a different length than the others * - **NATURAL_LANGUAGE_HAS_NAN** - Message code for when input natural language columns contain NaN values. * - **NO_VARIANCE** - Message code for when data has no variance (1 unique value). * - **NO_VARIANCE_WITH_NULL** - Message code for when data has one unique value and NaN values. * - **NO_VARIANCE_ZERO_UNIQUE** - Message code for when data has no variance (0 unique value) * - **NOT_UNIQUE_ENOUGH** - Message code for when data does not possess enough unique values. * - **TARGET_BINARY_NOT_TWO_UNIQUE_VALUES** - Message code for target data for a binary classification problem that does not have two unique values. * - **TARGET_HAS_NULL** - Message code for target data that has null values. * - **TARGET_INCOMPATIBLE_OBJECTIVE** - Message code for target data that has incompatible values for the specified objective * - **TARGET_IS_EMPTY_OR_FULLY_NULL** - Message code for target data that is empty or has all null values. * - **TARGET_IS_NONE** - Message code for when target is None. * - **TARGET_LEAKAGE** - Message code for when target leakage is detected. * - **TARGET_LOGNORMAL_DISTRIBUTION** - Message code for target data with a lognormal distribution. * - **TARGET_MULTICLASS_HIGH_UNIQUE_CLASS** - Message code for target data for a multi classification problem that has an abnormally large number of unique classes relative to the number of target values. * - **TARGET_MULTICLASS_NOT_ENOUGH_CLASSES** - Message code for target data for a multi classification problem that does not have more than two unique classes. * - **TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS** - Message code for target data for a multi classification problem that does not have two examples per class. * - **TARGET_UNSUPPORTED_PROBLEM_TYPE** - Message code for target data that is being checked against an unsupported problem type. * - **TARGET_UNSUPPORTED_TYPE** - Message code for target data that is of an unsupported type. * - **TARGET_UNSUPPORTED_TYPE_REGRESSION** - Message code for target data that is incompatible with regression * - **TIMESERIES_PARAMETERS_NOT_COMPATIBLE_WITH_SPLIT** - Message code when the time series parameters are too large for the smallest data split. * - **TIMESERIES_TARGET_NOT_COMPATIBLE_WITH_SPLIT** - Message code when any training and validation split of the time series target doesn't contain all classes. * - **TOO_SPARSE** - Message code for when multiclass data has values that are too sparsely populated. * - **TOO_UNIQUE** - Message code for when data possesses too many unique values. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheckMessageCode.name evalml.data_checks.DataCheckMessageCode.value .. py:method:: name(self) The name of the Enum member. .. py:method:: value(self) The value of the Enum member. .. py:class:: DataCheckMessageType Enum for type of data check message: WARNING or ERROR. **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **ERROR** - Error message returned by a data check. * - **WARNING** - Warning message returned by a data check. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheckMessageType.name evalml.data_checks.DataCheckMessageType.value .. py:method:: name(self) The name of the Enum member. .. py:method:: value(self) The value of the Enum member. .. py:class:: DataChecks(data_checks=None, data_check_params=None) A collection of data checks. :param data_checks: List of DataCheck objects. :type data_checks: list (DataCheck) :param data_check_params: Parameters for passed DataCheck objects. :type data_check_params: dict **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataChecks.validate .. py:method:: validate(self, X, y=None) Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable. :param X: The input data of shape [n_samples, n_features] :type X: pd.DataFrame, np.ndarray :param y: The target data of length [n_samples] :type y: pd.Series, np.ndarray :returns: Dictionary containing DataCheckMessage objects :rtype: dict .. py:class:: DataCheckWarning(message, data_check_name, message_code=None, details=None, action_options=None) DataCheckMessage subclass for warnings returned by data checks. **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **message_type** - DataCheckMessageType.WARNING **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DataCheckWarning.to_dict .. py:method:: to_dict(self) Return a dictionary form of the data check message. .. py:class:: DateTimeFormatDataCheck(datetime_column='index', nan_duplicate_threshold=0.75, series_id=None) Check if the datetime column has equally spaced intervals and is monotonically increasing or decreasing in order to be supported by time series estimators. If used for multiseries problem, works specifically on stacked datasets. :param datetime_column: The name of the datetime column. If the datetime values are in the index, then pass "index". :type datetime_column: str, int :param nan_duplicate_threshold: The percentage of values in the `datetime_column` that must not be duplicate or nan before `DATETIME_NO_FREQUENCY_INFERRED` is returned instead of `DATETIME_HAS_UNEVEN_INTERVALS`. For example, if this is set to 0.80, then only 20% of the values in `datetime_column` can be duplicate or nan. Defaults to 0.75. :type nan_duplicate_threshold: float :param series_id: The name of the series_id column for multiseries. Defaults to None :type series_id: str **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DateTimeFormatDataCheck.name evalml.data_checks.DateTimeFormatDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Checks if the target data has equal intervals and is monotonically increasing. Will return DataCheckError(s) if the data is not a datetime type, is not increasing, has redundant or missing row(s), contains invalid (NaN or None) values, or has values that don't align with the assumed frequency. If used for multiseries problem, works specifically on stacked datasets. :param X: Features. :type X: pd.DataFrame, np.ndarray :param y: Target data. :type y: pd.Series, np.ndarray :returns: List with DataCheckErrors if unequal intervals are found in the datetime column. :rtype: dict (DataCheckError) .. rubric:: Examples >>> import pandas as pd The column 'dates' has a set of two dates with daily frequency, two dates with hourly frequency, and two dates with monthly frequency. >>> X = pd.DataFrame(pd.date_range("2015-01-01", periods=2).append(pd.date_range("2015-01-08", periods=2, freq="H").append(pd.date_range("2016-03-02", periods=2, freq="M"))), columns=["dates"]) >>> y = pd.Series([0, 1, 0, 1, 1, 0]) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "No frequency could be detected in column 'dates', possibly due to uneven intervals or too many duplicate/missing values.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_NO_FREQUENCY_INFERRED", ... "details": {"columns": None, "rows": None}, ... "action_options": [] ... } ... ] The column "dates" has a gap in the values, which implies there are many dates missing. >>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-31", periods=50)), columns=["dates"]) >>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0]) >>> ww_payload = infer_frequency(X["dates"], debug=True, window_length=5, threshold=0.8) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Column 'dates' has datetime values missing between start and end date.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_IS_MISSING_VALUES", ... "details": {"columns": None, "rows": None}, ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'dates', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'dates', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ] The column "dates" has a repeat of the date 2021-01-09 appended to the end, which is considered redundant and will raise an error. >>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-09", periods=1)), columns=["dates"]) >>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0]) >>> ww_payload = infer_frequency(X["dates"], debug=True, window_length=5, threshold=0.8) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Column 'dates' has more than one row with the same datetime value.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_REDUNDANT_ROW", ... "details": {"columns": None, "rows": None}, ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'dates', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'dates', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ] The column "Weeks" has a date that does not follow the weekly pattern, which is considered misaligned. >>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=12).append(pd.date_range("2021-03-22", periods=1)), columns=["Weeks"]) >>> ww_payload = infer_frequency(X["Weeks"], debug=True, window_length=5, threshold=0.8) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Column 'Weeks' has datetime values that do not align with the inferred frequency.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_HAS_MISALIGNED_VALUES", ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'Weeks', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'Weeks', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ] The column "Weeks" passed integers instead of datetime data, which will raise an error. >>> X = pd.DataFrame([1, 2, 3, 4], columns=["Weeks"]) >>> y = pd.Series([0] * 4) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Datetime information could not be found in the data, or was not in a supported datetime format.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_INFORMATION_NOT_FOUND", ... "action_options": [] ... } ... ] Converting that same integer data to datetime, however, is valid. >>> X = pd.DataFrame(pd.to_datetime([1, 2, 3, 4]), columns=["Weeks"]) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == [] >>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=10), columns=["Weeks"]) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == [] While the data passed in is of datetime type, time series requires the datetime information in datetime_column to be monotonically increasing (ascending). >>> X = X.iloc[::-1] >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Datetime values must be sorted in ascending order.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_IS_NOT_MONOTONIC", ... "action_options": [] ... } ... ] The first value in the column "index" is replaced with NaT, which will raise an error in this data check. >>> dates = [["2-1-21", "3-1-21"], ... ["2-2-21", "3-2-21"], ... ["2-3-21", "3-3-21"], ... ["2-4-21", "3-4-21"], ... ["2-5-21", "3-5-21"], ... ["2-6-21", "3-6-21"], ... ["2-7-21", "3-7-21"], ... ["2-8-21", "3-8-21"], ... ["2-9-21", "3-9-21"], ... ["2-10-21", "3-10-21"], ... ["2-11-21", "3-11-21"], ... ["2-12-21", "3-12-21"]] >>> dates[0][0] = None >>> df = pd.DataFrame(dates, columns=["days", "days2"]) >>> ww_payload = infer_frequency(pd.to_datetime(df["days"]), debug=True, window_length=5, threshold=0.8) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="days") >>> assert datetime_format_dc.validate(df, y) == [ ... { ... "message": "Input datetime column 'days' contains NaN values. Please impute NaN values or drop these rows.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_HAS_NAN", ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'days', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'days', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ] For multiseries, the datacheck will go through each series and perform checks on them similar to the single series case To denote that the datacheck is checking a multiseries, pass in the name of the series_id column to the datacheck >>> X = pd.DataFrame( ... { ... "date": pd.date_range("2021-01-01", periods=15).repeat(2), ... "series_id": pd.Series(list(range(2)) * 15, dtype="str") ... } ... ) >>> X = X.drop([15]) >>> dc = DateTimeFormatDataCheck(datetime_column="date", series_id="series_id") >>> ww_payload_expected_series1 = infer_frequency((X[X["series_id"] == "1"]["date"].reset_index(drop=True)), debug=True, window_length=4, threshold=0.4) >>> xd = dc.validate(X,y) >>> assert dc.validate(X, y) == [ ... { ... "message": "Column 'date' for series '1' has datetime values missing between start and end date.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_IS_MISSING_VALUES", ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'date' for series '1', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'date', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload_expected_series1, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ] .. py:class:: DCAOParameterAllowedValuesType Enum for data check action option parameter allowed values type. **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **CATEGORICAL** - Categorical allowed values type. Parameters that have a set of allowed values. * - **NUMERICAL** - Numerical allowed values type. Parameters that have a range of allowed values. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DCAOParameterAllowedValuesType.name evalml.data_checks.DCAOParameterAllowedValuesType.value .. py:method:: name(self) The name of the Enum member. .. py:method:: value(self) The value of the Enum member. .. py:class:: DCAOParameterType Enum for data check action option parameter type. **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **COLUMN** - Column parameter type. Parameters that apply to a specific column in the data set. * - **GLOBAL** - Global parameter type. Parameters that apply to the entire data set. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DCAOParameterType.all_parameter_types evalml.data_checks.DCAOParameterType.handle_dcao_parameter_type evalml.data_checks.DCAOParameterType.name evalml.data_checks.DCAOParameterType.value .. py:method:: all_parameter_types(cls) Get a list of all defined parameter types. :returns: List of all defined parameter types. :rtype: list(DCAOParameterType) .. py:method:: handle_dcao_parameter_type(dcao_parameter_type) :staticmethod: Handles the data check action option parameter type by either returning the DCAOParameterType enum or converting from a str. :param dcao_parameter_type: Data check action option parameter type that needs to be handled. :type dcao_parameter_type: str or DCAOParameterType :returns: DCAOParameterType enum :raises KeyError: If input is not a valid DCAOParameterType enum value. :raises ValueError: If input is not a string or DCAOParameterType object. .. py:method:: name(self) The name of the Enum member. .. py:method:: value(self) The value of the Enum member. .. py:class:: DefaultDataChecks(problem_type, objective, n_splits=3, problem_configuration=None) A collection of basic data checks that is used by AutoML by default. Includes: - `NullDataCheck` - `HighlyNullRowsDataCheck` - `IDColumnsDataCheck` - `TargetLeakageDataCheck` - `InvalidTargetDataCheck` - `NoVarianceDataCheck` - `ClassImbalanceDataCheck` (for classification problem types) - `TargetDistributionDataCheck` (for regression problem types) - `DateTimeFormatDataCheck` (for time series problem types) - 'TimeSeriesParametersDataCheck' (for time series problem types) - `TimeSeriesSplittingDataCheck` (for time series classification problem types) :param problem_type: The problem type that is being validated. Can be regression, binary, or multiclass. :type problem_type: str :param objective: Name or instance of the objective class. :type objective: str or ObjectiveBase :param n_splits: The number of splits as determined by the data splitter being used. Defaults to 3. :type n_splits: int :param problem_configuration: Required for time series problem types. Values should be passed in for time_index, :type problem_configuration: dict :param gap: :param forecast_horizon: :param and max_delay.: **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.DefaultDataChecks.validate .. py:method:: validate(self, X, y=None) Inspect and validate the input data against data checks and returns a list of warnings and errors if applicable. :param X: The input data of shape [n_samples, n_features] :type X: pd.DataFrame, np.ndarray :param y: The target data of length [n_samples] :type y: pd.Series, np.ndarray :returns: Dictionary containing DataCheckMessage objects :rtype: dict .. py:class:: IDColumnsDataCheck(id_threshold=1.0, exclude_time_index=True) Check if any of the features are likely to be ID columns. :param id_threshold: The probability threshold to be considered an ID column. Defaults to 1.0. :type id_threshold: float :param exclude_time_index: If True, the column set as the time index will not be included in the data check. Default is True. :type exclude_time_index: bool **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.IDColumnsDataCheck.name evalml.data_checks.IDColumnsDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if any of the features are likely to be ID columns. Currently performs a number of simple checks. Checks performed are: - column name is "id" - column name ends in "_id" - column contains all unique values (and is categorical / integer type) :param X: The input features to check. :type X: pd.DataFrame, np.ndarray :param y: The target. Defaults to None. Ignored. :type y: pd.Series :returns: A dictionary of features with column name or index and their probability of being ID columns :rtype: dict .. rubric:: Examples >>> import pandas as pd Columns that end in "_id" and are completely unique are likely to be ID columns. >>> df = pd.DataFrame({ ... "profits": [25, 15, 15, 31, 19], ... "customer_id": [123, 124, 125, 126, 127], ... "Sales": [10, 42, 31, 51, 61] ... }) ... >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == [ ... { ... "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "code": "HAS_ID_COLUMN", ... "details": {"columns": ["customer_id"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "IDColumnsDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["customer_id"], "rows": None} ... } ... ] ... } ... ] Columns named "ID" with all unique values will also be identified as ID columns. >>> df = df.rename(columns={"customer_id": "ID"}) >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == [ ... { ... "message": "Columns 'ID' are 100.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "code": "HAS_ID_COLUMN", ... "details": {"columns": ["ID"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "IDColumnsDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["ID"], "rows": None} ... } ... ] ... } ... ] Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0 by default and its name doesn't indicate that it's an ID. >>> df = pd.DataFrame({ ... "humidity": ["high", "very high", "low", "low", "high"], ... "Country_Rank": [1, 2, 3, 4, 5], ... "Sales": ["very high", "high", "high", "medium", "very low"] ... }) ... >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == [] However lowering the threshold will cause this column to be identified as an ID. >>> id_col_check = IDColumnsDataCheck() >>> id_col_check = IDColumnsDataCheck(id_threshold=0.95) >>> assert id_col_check.validate(df) == [ ... { ... "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "details": {"columns": ["Country_Rank"], "rows": None}, ... "code": "HAS_ID_COLUMN", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "IDColumnsDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["Country_Rank"], "rows": None} ... } ... ] ... } ... ] If the first column of the dataframe has all unique values and is named either 'ID' or a name that ends with '_id', it is probably the primary key. The other ID columns should be dropped. >>> df = pd.DataFrame({ ... "sales_id": [0, 1, 2, 3, 4], ... "customer_id": [123, 124, 125, 126, 127], ... "Sales": [10, 42, 31, 51, 61] ... }) ... >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == [ ... { ... "message": "The first column 'sales_id' is likely to be the primary key", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "code": "HAS_ID_FIRST_COLUMN", ... "details": {"columns": ["sales_id"], "rows": None}, ... "action_options": [ ... { ... "code": "SET_FIRST_COL_ID", ... "data_check_name": "IDColumnsDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["sales_id"], "rows": None} ... } ... ] ... }, ... { ... "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "code": "HAS_ID_COLUMN", ... "details": {"columns": ["customer_id"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "IDColumnsDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["customer_id"], "rows": None} ... } ... ] ... } ... ] .. py:class:: InvalidTargetDataCheck(problem_type, objective, n_unique=100, null_strategy='drop') Check if the target data is considered invalid. Target data is considered invalid if: - Target is None. - Target has NaN or None values. - Target is of an unsupported Woodwork logical type. - Target and features have different lengths or indices. - Target does not have enough instances of a class in a classification problem. - Target does not contain numeric data for regression problems. :param problem_type: The specific problem type to data check for. e.g. 'binary', 'multiclass', 'regression, 'time series regression' :type problem_type: str or ProblemTypes :param objective: Name or instance of the objective class. :type objective: str or ObjectiveBase :param n_unique: Number of unique target values to store when problem type is binary and target incorrectly has more than 2 unique values. Non-negative integer. If None, stores all unique values. Defaults to 100. :type n_unique: int :param null_strategy: The type of action option that should be returned if the target is partially null. The options are `impute` and `drop` (default). `impute` - Will return a `DataCheckActionOption` for imputing the target column. `drop` - Will return a `DataCheckActionOption` for dropping the null rows in the target column. :type null_strategy: str **Attributes** .. list-table:: :widths: 15 85 :header-rows: 0 * - **multiclass_continuous_threshold** - 0.05 **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.InvalidTargetDataCheck.name evalml.data_checks.InvalidTargetDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Check if the target data is considered invalid. If the input features argument is not None, it will be used to check that the target and features have the same dimensions and indices. Target data is considered invalid if: - Target is None. - Target has NaN or None values. - Target is of an unsupported Woodwork logical type. - Target and features have different lengths or indices. - Target does not have enough instances of a class in a classification problem. - Target does not contain numeric data for regression problems. :param X: Features. If not None, will be used to check that the target and features have the same dimensions and indices. :type X: pd.DataFrame, np.ndarray :param y: Target data to check for invalid values. :type y: pd.Series, np.ndarray :returns: List with DataCheckErrors if any invalid values are found in the target data. :rtype: dict (DataCheckError) .. rubric:: Examples >>> import pandas as pd Target values must be integers, doubles, or booleans. >>> X = pd.DataFrame({"col": [1, 2, 3, 1]}) >>> y = pd.Series(["cat_1", "cat_2", "cat_1", "cat_2"]) >>> target_check = InvalidTargetDataCheck("regression", "R2", null_strategy="impute") >>> assert target_check.validate(X, y) == [ ... { ... "message": "Target is unsupported Unknown type. Valid Woodwork logical types include: integer, double, boolean, age, age_fractional, integer_nullable, boolean_nullable, age_nullable", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None, "unsupported_type": "unknown"}, ... "code": "TARGET_UNSUPPORTED_TYPE", ... "action_options": [] ... }, ... { ... "message": "Target data type should be numeric for regression type problems.", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "TARGET_UNSUPPORTED_TYPE_REGRESSION", ... "action_options": [] ... } ... ] The target cannot have null values. >>> y = pd.Series([None, pd.NA, pd.NaT, None]) >>> assert target_check.validate(X, y) == [ ... { ... "message": "Target is either empty or fully null.", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "TARGET_IS_EMPTY_OR_FULLY_NULL", ... "action_options": [] ... } ... ] ... ... >>> y = pd.Series([1, None, 3, None]) >>> assert target_check.validate(None, y) == [ ... { ... "message": "2 row(s) (50.0%) of target values are null", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": { ... "columns": None, ... "rows": [1, 3], ... "num_null_rows": 2, ... "pct_null_rows": 50.0 ... }, ... "code": "TARGET_HAS_NULL", ... "action_options": [ ... { ... "code": "IMPUTE_COL", ... "data_check_name": "InvalidTargetDataCheck", ... "parameters": { ... "impute_strategy": { ... "parameter_type": "global", ... "type": "category", ... "categories": ["mean", "most_frequent"], ... "default_value": "mean" ... } ... }, ... "metadata": {"columns": None, "rows": None, "is_target": True}, ... } ... ], ... } ... ] If the target values don't match the problem type passed, an error will be raised. In this instance, only two values exist in the target column, but multiclass has been passed as the problem type. >>> X = pd.DataFrame([i for i in range(50)]) >>> y = pd.Series([i%2 for i in range(50)]) >>> target_check = InvalidTargetDataCheck("multiclass", "Log Loss Multiclass") >>> assert target_check.validate(X, y) == [ ... { ... "message": "Target has two or less classes, which is too few for multiclass problems. Consider changing to binary.", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None, "num_classes": 2}, ... "code": "TARGET_MULTICLASS_NOT_ENOUGH_CLASSES", ... "action_options": [] ... } ... ] If the length of X and y differ, a warning will be raised. A warning will also be raised for indices that don"t match. >>> target_check = InvalidTargetDataCheck("regression", "R2") >>> X = pd.DataFrame([i for i in range(5)]) >>> y = pd.Series([1, 2, 4, 3], index=[1, 2, 4, 3]) >>> assert target_check.validate(X, y) == [ ... { ... "message": "Input target and features have different lengths", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "warning", ... "details": {"columns": None, "rows": None, "features_length": 5, "target_length": 4}, ... "code": "MISMATCHED_LENGTHS", ... "action_options": [] ... }, ... { ... "message": "Input target and features have mismatched indices. Details will include the first 10 mismatched indices.", ... "data_check_name": "InvalidTargetDataCheck", ... "level": "warning", ... "details": { ... "columns": None, ... "rows": None, ... "indices_not_in_features": [], ... "indices_not_in_target": [0] ... }, ... "code": "MISMATCHED_INDICES", ... "action_options": [] ... } ... ] .. py:class:: MismatchedSeriesLengthDataCheck(series_id) Check if one or more unique series in a multiseries dataset is of a different length than the others. Currently works specifically on stacked data :param series_id: The name of the series_id column for the dataset. :type series_id: str **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.MismatchedSeriesLengthDataCheck.name evalml.data_checks.MismatchedSeriesLengthDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if one or more unique series in a multiseries dataset is of a different length than the other. Currently works specifically on stacked data :param X: The input features to check. Must have a series_id column. :type X: pd.DataFrame, np.ndarray :param y: The target. Defaults to None. Ignored. :type y: pd.Series :returns: List with DataCheckWarning if there are mismatch series length in the datasets or list with DataCheckError if the given series_id is not in the dataset :rtype: dict (DataCheckWarning, DataCheckError) .. rubric:: Examples >>> import pandas as pd For multiseries time series datasets, each seriesID should ideally have the same number of datetime entries as each other. If they don't, then a warning will be raised denoting which seriesID have mismatched lengths. >>> X = pd.DataFrame( ... { ... "date": pd.date_range(start="1/1/2018", periods=20).repeat(5), ... "series_id": pd.Series(list(range(5)) * 20, dtype="str"), ... "feature_a": range(100), ... "feature_b": reversed(range(100)), ... }, ... ) >>> X = X.drop(labels=0, axis=0) >>> mismatched_series_length_check = MismatchedSeriesLengthDataCheck("series_id") >>> assert mismatched_series_length_check.validate(X) == [ ... { ... "message": "Series ID ['0'] do not match the majority length of the other series, which is 20", ... "data_check_name": "MismatchedSeriesLengthDataCheck", ... "level": "warning", ... "details": { ... "columns": None, ... "rows": None, ... "series_id": ['0'], ... "majority_length": 20 ... }, ... "code": "MISMATCHED_SERIES_LENGTH", ... "action_options": [], ... } ... ] If MismatchedSeriesLengthDataCheck is passed in an invalid series_id column name, then an error will be raised. >>> X = pd.DataFrame( ... { ... "date": pd.date_range(start="1/1/2018", periods=20).repeat(5), ... "series_id": pd.Series(list(range(5)) * 20, dtype="str"), ... "feature_a": range(100), ... "feature_b": reversed(range(100)), ... }, ... ) >>> X = X.drop(labels=0, axis=0) >>> mismatched_series_length_check = MismatchedSeriesLengthDataCheck("not_series_id") >>> assert mismatched_series_length_check.validate(X) == [ ... { ... "message": "series_id 'not_series_id' is not in the dataset.", ... "data_check_name": "MismatchedSeriesLengthDataCheck", ... "level": "error", ... "details": { ... "columns": None, ... "rows": None, ... "series_id": "not_series_id", ... }, ... "code": "INVALID_SERIES_ID_COL", ... "action_options": [], ... } ... ] If there are multiple lengths that have the same number of series (e.g. two series have length 20 and two series have length 19), this datacheck will consider the higher length to be the majority length (e.g. from the previous example length 20 would be the majority length) >>> X = pd.DataFrame( ... { ... "date": pd.date_range(start="1/1/2018", periods=20).repeat(4), ... "series_id": pd.Series(list(range(4)) * 20, dtype="str"), ... "feature_a": range(80), ... "feature_b": reversed(range(80)), ... }, ... ) >>> X = X.drop(labels=[0, 1], axis=0) >>> mismatched_series_length_check = MismatchedSeriesLengthDataCheck("series_id") >>> assert mismatched_series_length_check.validate(X) == [ ... { ... "message": "Series ID ['0', '1'] do not match the majority length of the other series, which is 20", ... "data_check_name": "MismatchedSeriesLengthDataCheck", ... "level": "warning", ... "details": { ... "columns": None, ... "rows": None, ... "series_id": ['0', '1'], ... "majority_length": 20 ... }, ... "code": "MISMATCHED_SERIES_LENGTH", ... "action_options": [], ... } ... ] .. py:class:: MulticollinearityDataCheck(threshold=0.9) Check if any set features are likely to be multicollinear. :param threshold: The threshold to be considered. Defaults to 0.9. :type threshold: float **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.MulticollinearityDataCheck.name evalml.data_checks.MulticollinearityDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if any set of features are likely to be multicollinear. :param X: The input features to check. :type X: pd.DataFrame :param y: The target. Ignored. :type y: pd.Series :returns: dict with a DataCheckWarning if there are any potentially multicollinear columns. :rtype: dict .. rubric:: Example >>> import pandas as pd Columns in X that are highly correlated with each other will be identified using mutual information. >>> col = pd.Series([1, 0, 2, 3, 4] * 15) >>> X = pd.DataFrame({"col_1": col, "col_2": col * 3}) >>> y = pd.Series([1, 0, 0, 1, 0] * 15) ... >>> multicollinearity_check = MulticollinearityDataCheck(threshold=1.0) >>> assert multicollinearity_check.validate(X, y) == [ ... { ... "message": "Columns are likely to be correlated: [('col_1', 'col_2')]", ... "data_check_name": "MulticollinearityDataCheck", ... "level": "warning", ... "code": "IS_MULTICOLLINEAR", ... "details": {"columns": [("col_1", "col_2")], "rows": None}, ... "action_options": [] ... } ... ] .. py:class:: NoVarianceDataCheck(count_nan_as_value=False) Check if the target or any of the features have no variance. :param count_nan_as_value: If True, missing values will be counted as their own unique value. Additionally, if true, will return a DataCheckWarning instead of an error if the feature has mostly missing data and only one unique value. Defaults to False. :type count_nan_as_value: bool **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.NoVarianceDataCheck.name evalml.data_checks.NoVarianceDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if the target or any of the features have no variance (1 unique value). :param X: The input features. :type X: pd.DataFrame, np.ndarray :param y: Optional, the target data. :type y: pd.Series, np.ndarray :returns: A dict of warnings/errors corresponding to features or target with no variance. :rtype: dict .. rubric:: Examples >>> import pandas as pd Columns or target data that have only one unique value will raise an error. >>> X = pd.DataFrame([2, 2, 2, 2, 2, 2, 2, 2], columns=["First_Column"]) >>> y = pd.Series([1, 1, 1, 1, 1, 1, 1, 1]) ... >>> novar_dc = NoVarianceDataCheck() >>> assert novar_dc.validate(X, y) == [ ... { ... "message": "'First_Column' has 1 unique value.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["First_Column"], "rows": None}, ... "code": "NO_VARIANCE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NoVarianceDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["First_Column"], "rows": None} ... }, ... ] ... }, ... { ... "message": "Y has 1 unique value.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["Y"], "rows": None}, ... "code": "NO_VARIANCE", ... "action_options": [] ... } ... ] By default, NaNs will not be counted as distinct values. In the first example, there are still two distinct values besides None. In the second, there are no distinct values as the target is entirely null. >>> X["First_Column"] = [2, 2, 2, 3, 3, 3, None, None] >>> y = pd.Series([1, 1, 1, 2, 2, 2, None, None]) >>> assert novar_dc.validate(X, y) == [] ... ... >>> y = pd.Series([None] * 7) >>> assert novar_dc.validate(X, y) == [ ... { ... "message": "Y has 0 unique values.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["Y"], "rows": None}, ... "code": "NO_VARIANCE_ZERO_UNIQUE", ... "action_options":[] ... } ... ] As None is not considered a distinct value by default, there is only one unique value in X and y. >>> X["First_Column"] = [2, 2, 2, 2, None, None, None, None] >>> y = pd.Series([1, 1, 1, 1, None, None, None, None]) >>> assert novar_dc.validate(X, y) == [ ... { ... "message": "'First_Column' has 1 unique value.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["First_Column"], "rows": None}, ... "code": "NO_VARIANCE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NoVarianceDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["First_Column"], "rows": None} ... }, ... ] ... }, ... { ... "message": "Y has 1 unique value.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["Y"], "rows": None}, ... "code": "NO_VARIANCE", ... "action_options": [] ... } ... ] If count_nan_as_value is set to True, then NaNs are counted as unique values. In the event that there is an adequate number of unique values only because count_nan_as_value is set to True, a warning will be raised so the user can encode these values. >>> novar_dc = NoVarianceDataCheck(count_nan_as_value=True) >>> assert novar_dc.validate(X, y) == [ ... { ... "message": "'First_Column' has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["First_Column"], "rows": None}, ... "code": "NO_VARIANCE_WITH_NULL", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NoVarianceDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["First_Column"], "rows": None} ... }, ... ] ... }, ... { ... "message": "Y has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["Y"], "rows": None}, ... "code": "NO_VARIANCE_WITH_NULL", ... "action_options": [] ... } ... ] .. py:class:: NullDataCheck(pct_null_col_threshold=0.95, pct_moderately_null_col_threshold=0.2, pct_null_row_threshold=0.95) Check if there are any highly-null numerical, boolean, categorical, natural language, and unknown columns and rows in the input. :param pct_null_col_threshold: If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95. :type pct_null_col_threshold: float :param pct_moderately_null_col_threshold: If the percentage of NaN values in an input feature exceeds this amount but is less than the percentage specified in pct_null_col_threshold, that column will be considered moderately-null. Defaults to 0.20. :type pct_moderately_null_col_threshold: float :param pct_null_row_threshold: If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95. :type pct_null_row_threshold: float **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.NullDataCheck.get_null_column_information evalml.data_checks.NullDataCheck.get_null_row_information evalml.data_checks.NullDataCheck.name evalml.data_checks.NullDataCheck.validate .. py:method:: get_null_column_information(X, pct_null_col_threshold=0.0) :staticmethod: Finds columns that are considered highly null (percentage null is greater than threshold) and returns dictionary mapping column name to percentage null and dictionary mapping column name to null indices. :param X: DataFrame to check for highly null columns. :type X: pd.DataFrame :param pct_null_col_threshold: Percentage threshold for a column to be considered null. Defaults to 0.0. :type pct_null_col_threshold: float :returns: Tuple containing: dictionary mapping column name to its null percentage and dictionary mapping column name to null indices in that column. :rtype: tuple .. py:method:: get_null_row_information(X, pct_null_row_threshold=0.0) :staticmethod: Finds rows that are considered highly null (percentage null is greater than threshold). :param X: DataFrame to check for highly null rows. :type X: pd.DataFrame :param pct_null_row_threshold: Percentage threshold for a row to be considered null. Defaults to 0.0. :type pct_null_row_threshold: float :returns: Series containing the percentage null for each row. :rtype: pd.Series .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if there are any highly-null columns or rows in the input. :param X: Features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. Defaults to None. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckWarning if there are any highly-null columns or rows. :rtype: dict .. rubric:: Examples >>> import pandas as pd ... >>> class SeriesWrap(): ... def __init__(self, series): ... self.series = series ... ... def __eq__(self, series_2): ... return all(self.series.eq(series_2.series)) With pct_null_col_threshold set to 0.50, any column that has 50% or more of its observations set to null will be included in the warning, as well as the percentage of null values identified ("all_null": 1.0, "lots_of_null": 0.8). >>> df = pd.DataFrame({ ... "all_null": [None, pd.NA, None, None, None], ... "lots_of_null": [None, None, None, None, 5], ... "few_null": [1, 2, None, 2, 3], ... "no_null": [1, 2, 3, 4, 5] ... }) ... >>> highly_null_dc = NullDataCheck(pct_null_col_threshold=0.50) >>> assert highly_null_dc.validate(df) == [ ... { ... "message": "Column(s) 'all_null', 'lots_of_null' are 50.0% or more null", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": { ... "columns": ["all_null", "lots_of_null"], ... "rows": None, ... "pct_null_rows": {"all_null": 1.0, "lots_of_null": 0.8} ... }, ... "code": "HIGHLY_NULL_COLS", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NullDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["all_null", "lots_of_null"], "rows": None} ... } ... ] ... }, ... { ... "message": "Column(s) 'few_null' have between 20.0% and 50.0% null values", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": {"columns": ["few_null"], "rows": None}, ... "code": "COLS_WITH_NULL", ... "action_options": [ ... { ... "code": "IMPUTE_COL", ... "data_check_name": "NullDataCheck", ... "metadata": {"columns": ["few_null"], "rows": None, "is_target": False}, ... "parameters": { ... "impute_strategies": { ... "parameter_type": "column", ... "columns": { ... "few_null": { ... "impute_strategy": {"categories": ["mean", "most_frequent"], "type": "category", "default_value": "mean"} ... } ... } ... } ... } ... } ... ] ... } ... ] With pct_null_row_threshold set to 0.50, any row with 50% or more of its respective column values set to null will included in the warning, as well as the offending rows ("rows": [0, 1, 2, 3]). Since the default value for pct_null_col_threshold is 0.95, "all_null" is also included in the warnings since the percentage of null values in that row is over 95%. Since the default value for pct_moderately_null_col_threshold is 0.20, "few_null" is included as a "moderately null" column as it has a null column percentage of 20%. >>> highly_null_dc = NullDataCheck(pct_null_row_threshold=0.50) >>> validation_messages = highly_null_dc.validate(df) >>> validation_messages[0]["details"]["pct_null_cols"] = SeriesWrap(validation_messages[0]["details"]["pct_null_cols"]) >>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.75, 0.5])) >>> assert validation_messages == [ ... { ... "message": "4 out of 5 rows are 50.0% or more null", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": { ... "columns": None, ... "rows": [0, 1, 2, 3], ... "pct_null_cols": highly_null_rows ... }, ... "code": "HIGHLY_NULL_ROWS", ... "action_options": [ ... { ... "code": "DROP_ROWS", ... "data_check_name": "NullDataCheck", ... "parameters": {}, ... "metadata": {"columns": None, "rows": [0, 1, 2, 3]} ... } ... ] ... }, ... { ... "message": "Column(s) 'all_null' are 95.0% or more null", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": { ... "columns": ["all_null"], ... "rows": None, ... "pct_null_rows": {"all_null": 1.0} ... }, ... "code": "HIGHLY_NULL_COLS", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NullDataCheck", ... "metadata": {"columns": ["all_null"], "rows": None}, ... "parameters": {} ... } ... ] ... }, ... { ... "message": "Column(s) 'lots_of_null', 'few_null' have between 20.0% and 95.0% null values", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": {"columns": ["lots_of_null", "few_null"], "rows": None}, ... "code": "COLS_WITH_NULL", ... "action_options": [ ... { ... "code": "IMPUTE_COL", ... "data_check_name": "NullDataCheck", ... "metadata": {"columns": ["lots_of_null", "few_null"], "rows": None, "is_target": False}, ... "parameters": { ... "impute_strategies": { ... "parameter_type": "column", ... "columns": { ... "lots_of_null": {"impute_strategy": {"categories": ["mean", "most_frequent"], "type": "category", "default_value": "mean"}}, ... "few_null": {"impute_strategy": {"categories": ["mean", "most_frequent"], "type": "category", "default_value": "mean"}} ... } ... } ... } ... } ... ] ... } ... ] .. py:class:: OutliersDataCheck Checks if there are any outliers in input data by using IQR to determine score anomalies. Columns with score anomalies are considered to contain outliers. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.OutliersDataCheck.get_boxplot_data evalml.data_checks.OutliersDataCheck.name evalml.data_checks.OutliersDataCheck.validate .. py:method:: get_boxplot_data(data_) :staticmethod: Returns box plot information for the given data. :param data_: Input data. :type data_: pd.Series, np.ndarray :returns: A payload of box plot statistics. :rtype: dict .. rubric:: Examples >>> import pandas as pd ... >>> df = pd.DataFrame({ ... "x": [1, 2, 3, 4, 5], ... "y": [6, 7, 8, 9, 10], ... "z": [-1, -2, -3, -1201, -4] ... }) >>> box_plot_data = OutliersDataCheck.get_boxplot_data(df["z"]) >>> box_plot_data["score"] = round(box_plot_data["score"], 2) >>> assert box_plot_data == { ... "score": 0.89, ... "pct_outliers": 0.2, ... "values": {"q1": -4.0, ... "median": -3.0, ... "q3": -2.0, ... "low_bound": -7.0, ... "high_bound": -1.0, ... "low_values": [-1201], ... "high_values": [], ... "low_indices": [3], ... "high_indices": []} ... } .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if there are any outliers in a dataframe by using IQR to determine column anomalies. Column with anomalies are considered to contain outliers. :param X: Input features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. Defaults to None. :type y: pd.Series, np.ndarray :returns: A dictionary with warnings if any columns have outliers. :rtype: dict .. rubric:: Examples >>> import pandas as pd The column "z" has an outlier so a warning is added to alert the user of its location. >>> df = pd.DataFrame({ ... "x": [1, 2, 3, 4, 5], ... "y": [6, 7, 8, 9, 10], ... "z": [-1, -2, -3, -1201, -4] ... }) ... >>> outliers_check = OutliersDataCheck() >>> assert outliers_check.validate(df) == [ ... { ... "message": "Column(s) 'z' are likely to have outlier data.", ... "data_check_name": "OutliersDataCheck", ... "level": "warning", ... "code": "HAS_OUTLIERS", ... "details": {"columns": ["z"], "rows": [3], "column_indices": {"z": [3]}}, ... "action_options": [ ... { ... "code": "DROP_ROWS", ... "data_check_name": "OutliersDataCheck", ... "parameters": {}, ... "metadata": {"rows": [3], "columns": None} ... } ... ] ... } ... ] .. py:class:: SparsityDataCheck(problem_type, threshold, unique_count_threshold=10) Check if there are any columns with sparsely populated values in the input. :param problem_type: The specific problem type to data check for. 'multiclass' or 'time series multiclass' is the only accepted problem type. :type problem_type: str or ProblemTypes :param threshold: The threshold value, or percentage of each column's unique values, below which, a column exhibits sparsity. Should be between 0 and 1. :type threshold: float :param unique_count_threshold: The minimum number of times a unique value has to be present in a column to not be considered "sparse." Defaults to 10. :type unique_count_threshold: int **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.SparsityDataCheck.name evalml.data_checks.SparsityDataCheck.sparsity_score evalml.data_checks.SparsityDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: sparsity_score(col, count_threshold=10) :staticmethod: Calculate a sparsity score for the given value counts by calculating the percentage of unique values that exceed the count_threshold. :param col: Feature values. :type col: pd.Series :param count_threshold: The number of instances below which a value is considered sparse. Default is 10. :type count_threshold: int :returns: Sparsity score, or the percentage of the unique values that exceed count_threshold. :rtype: (float) .. py:method:: validate(self, X, y=None) Calculate what percentage of each column's unique values exceed the count threshold and compare that percentage to the sparsity threshold stored in the class instance. :param X: Features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckWarning if there are any sparse columns. :rtype: dict .. rubric:: Examples >>> import pandas as pd For multiclass problems, if a column doesn't have enough representation from unique values, it will be considered sparse. >>> df = pd.DataFrame({ ... "sparse": [float(x) for x in range(100)], ... "not_sparse": [float(1) for x in range(100)] ... }) ... >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=0.5, unique_count_threshold=10) >>> assert sparsity_check.validate(df) == [ ... { ... "message": "Input columns ('sparse') for multiclass problem type are too sparse.", ... "data_check_name": "SparsityDataCheck", ... "level": "warning", ... "code": "TOO_SPARSE", ... "details": { ... "columns": ["sparse"], ... "sparsity_score": {"sparse": 0.0}, ... "rows": None ... }, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "SparsityDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["sparse"], "rows": None} ... } ... ] ... } ... ] ... >>> df["sparse"] = [float(x % 10) for x in range(100)] >>> sparsity_check = SparsityDataCheck(problem_type="multiclass", threshold=1, unique_count_threshold=5) >>> assert sparsity_check.validate(df) == [] ... >>> sparse_array = pd.Series([1, 1, 1, 2, 2, 3] * 3) >>> assert SparsityDataCheck.sparsity_score(sparse_array, count_threshold=5) == 0.6666666666666666 .. py:class:: TargetDistributionDataCheck Check if the target data contains certain distributions that may need to be transformed prior training to improve model performance. Uses the Shapiro-Wilks test when the dataset is <=5000 samples, otherwise uses Jarque-Bera. **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.TargetDistributionDataCheck.name evalml.data_checks.TargetDistributionDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Check if the target data has a certain distribution. :param X: Features. Ignored. :type X: pd.DataFrame, np.ndarray :param y: Target data to check for underlying distributions. :type y: pd.Series, np.ndarray :returns: List with DataCheckErrors if certain distributions are found in the target data. :rtype: dict (DataCheckError) .. rubric:: Examples >>> import pandas as pd Targets that exhibit a lognormal distribution will raise a warning for the user to transform the target. >>> y = [0.946, 0.972, 1.154, 0.954, 0.969, 1.222, 1.038, 0.999, 0.973, 0.897] >>> target_check = TargetDistributionDataCheck() >>> assert target_check.validate(None, y) == [ ... { ... "message": "Target may have a lognormal distribution.", ... "data_check_name": "TargetDistributionDataCheck", ... "level": "warning", ... "code": "TARGET_LOGNORMAL_DISTRIBUTION", ... "details": {"normalization_method": "shapiro", "statistic": 0.8, "p-value": 0.045, "columns": None, "rows": None}, ... "action_options": [ ... { ... "code": "TRANSFORM_TARGET", ... "data_check_name": "TargetDistributionDataCheck", ... "parameters": {}, ... "metadata": { ... "transformation_strategy": "lognormal", ... "is_target": True, ... "columns": None, ... "rows": None ... } ... } ... ] ... } ... ] ... >>> y = pd.Series([1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5]) >>> assert target_check.validate(None, y) == [] ... ... >>> y = pd.Series(pd.date_range("1/1/21", periods=10)) >>> assert target_check.validate(None, y) == [ ... { ... "message": "Target is unsupported datetime type. Valid Woodwork logical types include: integer, double, age, age_fractional", ... "data_check_name": "TargetDistributionDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None, "unsupported_type": "datetime"}, ... "code": "TARGET_UNSUPPORTED_TYPE", ... "action_options": [] ... } ... ] .. py:class:: TargetLeakageDataCheck(pct_corr_threshold=0.95, method='all') Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and other correlation metrics. If method='mutual_info', this data check uses mutual information and supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected. Correlation metrics available can be found in Woodwork's `dependence_dict method `_. :param pct_corr_threshold: The correlation threshold to be considered leakage. Defaults to 0.95. :type pct_corr_threshold: float :param method: The method to determine correlation. Use 'all' or 'max' for the maximum correlation, or for specific correlation metrics, use their name (ie 'mutual_info' for mutual information, 'pearson' for Pearson correlation, etc). possible methods can be found in Woodwork's `config `_, under `correlation_metrics`. Defaults to 'all'. :type method: string **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.TargetLeakageDataCheck.name evalml.data_checks.TargetLeakageDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Check if any of the features are highly correlated with the target by using mutual information, Pearson correlation, and/or Spearman correlation. If `method='mutual_info'` or `'method='max'`, supports all target and feature types. Other correlation metrics only support binary with numeric and boolean dtypes. This method will return a value in [-1, 1] if other correlation metrics are selected and will returns a value in [0, 1] if mutual information is selected. :param X: The input features to check. :type X: pd.DataFrame, np.ndarray :param y: The target data. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckWarning if target leakage is detected. :rtype: dict (DataCheckWarning) .. rubric:: Examples >>> import pandas as pd Any columns that are strongly correlated with the target will raise a warning. This could be indicative of data leakage. >>> X = pd.DataFrame({ ... "leak": [10, 42, 31, 51, 61] * 15, ... "x": [42, 54, 12, 64, 12] * 15, ... "y": [13, 5, 13, 74, 24] * 15, ... }) >>> y = pd.Series([10, 42, 31, 51, 40] * 15) ... >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95) >>> assert target_leakage_check.validate(X, y) == [ ... { ... "message": "Column 'leak' is 95.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "code": "TARGET_LEAKAGE", ... "details": {"columns": ["leak"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "TargetLeakageDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["leak"], "rows": None} ... } ... ] ... } ... ] The default method can be changed to pearson from mutual_info. >>> X["x"] = y / 2 >>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.8, method="pearson") >>> assert target_leakage_check.validate(X, y) == [ ... { ... "message": "Columns 'leak', 'x' are 80.0% or more correlated with the target", ... "data_check_name": "TargetLeakageDataCheck", ... "level": "warning", ... "details": {"columns": ["leak", "x"], "rows": None}, ... "code": "TARGET_LEAKAGE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "TargetLeakageDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["leak", "x"], "rows": None} ... } ... ] ... } ... ] .. py:class:: TimeSeriesParametersDataCheck(problem_configuration, n_splits) Checks whether the time series parameters are compatible with data splitting. If `gap + max_delay + forecast_horizon > X.shape[0] // (n_splits + 1)` then the feature engineering window is larger than the smallest split. This will cause the pipeline to create features from data that does not exist, which will cause errors. :param problem_configuration: Dict containing problem_configuration parameters. :type problem_configuration: dict :param n_splits: Number of time series splits. :type n_splits: int **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.TimeSeriesParametersDataCheck.name evalml.data_checks.TimeSeriesParametersDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if the time series parameters are compatible with data splitting. :param X: Features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. Defaults to None. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckError if parameters are too big for the split sizes. :rtype: dict .. rubric:: Examples >>> import pandas as pd The time series parameters have to be compatible with the data passed. If the window size (gap + max_delay + forecast_horizon) is greater than or equal to the split size, then an error will be raised. >>> X = pd.DataFrame({ ... "dates": pd.date_range("1/1/21", periods=100), ... "first": [i for i in range(100)], ... }) >>> y = pd.Series([i for i in range(100)]) ... >>> problem_config = {"gap": 7, "max_delay": 2, "forecast_horizon": 12, "time_index": "dates"} >>> ts_parameters_check = TimeSeriesParametersDataCheck(problem_configuration=problem_config, n_splits=7) >>> assert ts_parameters_check.validate(X, y) == [ ... { ... "message": "Since the data has 100 observations, n_splits=7, and a forecast horizon of 12, the smallest " ... "split would have 16 observations. Since 21 (gap + max_delay + forecast_horizon)" ... " >= 16, then at least one of the splits would be empty by the time it reaches " ... "the pipeline. Please use a smaller number of splits, reduce one or more these " ... "parameters, or collect more data.", ... "data_check_name": "TimeSeriesParametersDataCheck", ... "level": "error", ... "code": "TIMESERIES_PARAMETERS_NOT_COMPATIBLE_WITH_SPLIT", ... "details": { ... "columns": None, ... "rows": None, ... "max_window_size": 21, ... "min_split_size": 16, ... "n_obs": 100, ... "n_splits": 7 ... }, ... "action_options": [] ... } ... ] .. py:class:: TimeSeriesSplittingDataCheck(problem_type, n_splits) Checks whether the time series target data is compatible with splitting. If the target data in the training and validation of every split doesn't have representation from all classes (for time series classification problems) this will prevent the estimators from training on all potential outcomes which will cause errors during prediction. :param problem_type: Problem type. :type problem_type: str or ProblemTypes :param n_splits: Number of time series splits. :type n_splits: int **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.TimeSeriesSplittingDataCheck.name evalml.data_checks.TimeSeriesSplittingDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y) Check if the training and validation targets are compatible with time series data splitting. :param X: Ignored. Features. :type X: pd.DataFrame, np.ndarray :param y: Target data. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckError if splitting would result in inadequate class representation. :rtype: dict .. rubric:: Example >>> import pandas as pd Passing n_splits as 3 means that the data will be segmented into 4 parts to be iterated over for training and validation splits. The first split results in training indices of [0:25] and validation indices of [25:50]. The training indices of the first split result in only one unique value (0). The third split results in training indices of [0:75] and validation indices of [75:100]. The validation indices of the third split result in only one unique value (1). >>> X = None >>> y = pd.Series([0 if i < 45 else i % 2 if i < 55 else 1 for i in range(100)]) >>> ts_splitting_check = TimeSeriesSplittingDataCheck("time series binary", 3) >>> assert ts_splitting_check.validate(X, y) == [ ... { ... "message": "Time Series Binary and Time Series Multiclass problem " ... "types require every training and validation split to " ... "have at least one instance of all the target classes. " ... "The following splits are invalid: [1, 3]", ... "data_check_name": "TimeSeriesSplittingDataCheck", ... "level": "error", ... "details": { ... "columns": None, "rows": None, ... "invalid_splits": { ... 1: {"Training": [0, 25]}, ... 3: {"Validation": [75, 100]} ... } ... }, ... "code": "TIMESERIES_TARGET_NOT_COMPATIBLE_WITH_SPLIT", ... "action_options": [] ... } ... ] .. py:class:: UniquenessDataCheck(problem_type, threshold=0.5) Check if there are any columns in the input that are either too unique for classification problems or not unique enough for regression problems. :param problem_type: The specific problem type to data check for. e.g. 'binary', 'multiclass', 'regression, 'time series regression' :type problem_type: str or ProblemTypes :param threshold: The threshold to set as an upper bound on uniqueness for classification type problems or lower bound on for regression type problems. Defaults to 0.50. :type threshold: float **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.UniquenessDataCheck.name evalml.data_checks.UniquenessDataCheck.uniqueness_score evalml.data_checks.UniquenessDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: uniqueness_score(col, drop_na=True) :staticmethod: Calculate a uniqueness score for the provided field. NaN values are not considered as unique values in the calculation. Based on the Herfindahl-Hirschman Index. :param col: Feature values. :type col: pd.Series :param drop_na: Whether to drop null values when computing the uniqueness score. Defaults to True. :type drop_na: bool :returns: Uniqueness score. :rtype: (float) .. py:method:: validate(self, X, y=None) Check if there are any columns in the input that are too unique in the case of classification problems or not unique enough in the case of regression problems. :param X: Features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. Defaults to None. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckWarning if there are any too unique or not unique enough columns. :rtype: dict .. rubric:: Examples >>> import pandas as pd Because the problem type is regression, the column "regression_not_unique_enough" raises a warning for having just one value. >>> df = pd.DataFrame({ ... "regression_unique_enough": [float(x) for x in range(100)], ... "regression_not_unique_enough": [float(1) for x in range(100)] ... }) ... >>> uniqueness_check = UniquenessDataCheck(problem_type="regression", threshold=0.8) >>> assert uniqueness_check.validate(df) == [ ... { ... "message": "Input columns 'regression_not_unique_enough' for regression problem type are not unique enough.", ... "data_check_name": "UniquenessDataCheck", ... "level": "warning", ... "code": "NOT_UNIQUE_ENOUGH", ... "details": {"columns": ["regression_not_unique_enough"], "uniqueness_score": {"regression_not_unique_enough": 0.0}, "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "parameters": {}, ... "data_check_name": "UniquenessDataCheck", ... "metadata": {"columns": ["regression_not_unique_enough"], "rows": None} ... } ... ] ... } ... ] For multiclass, the column "regression_unique_enough" has too many unique values and will raise an appropriate warning. >>> y = pd.Series([1, 1, 1, 2, 2, 3, 3, 3]) >>> uniqueness_check = UniquenessDataCheck(problem_type="multiclass", threshold=0.8) >>> assert uniqueness_check.validate(df) == [ ... { ... "message": "Input columns 'regression_unique_enough' for multiclass problem type are too unique.", ... "data_check_name": "UniquenessDataCheck", ... "level": "warning", ... "details": { ... "columns": ["regression_unique_enough"], ... "rows": None, ... "uniqueness_score": {"regression_unique_enough": 0.99} ... }, ... "code": "TOO_UNIQUE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "UniquenessDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["regression_unique_enough"], "rows": None} ... } ... ] ... } ... ] ... >>> assert UniquenessDataCheck.uniqueness_score(y) == 0.65625