null_data_check ============================================ .. py:module:: evalml.data_checks.null_data_check .. autoapi-nested-parse:: Data check that checks if there are any highly-null columns and rows in the input. Module Contents --------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.null_data_check.NullDataCheck Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: NullDataCheck(pct_null_col_threshold=0.95, pct_moderately_null_col_threshold=0.2, pct_null_row_threshold=0.95) Check if there are any highly-null numerical, boolean, categorical, natural language, and unknown columns and rows in the input. :param pct_null_col_threshold: If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95. :type pct_null_col_threshold: float :param pct_moderately_null_col_threshold: If the percentage of NaN values in an input feature exceeds this amount but is less than the percentage specified in pct_null_col_threshold, that column will be considered moderately-null. Defaults to 0.20. :type pct_moderately_null_col_threshold: float :param pct_null_row_threshold: If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95. :type pct_null_row_threshold: float **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.null_data_check.NullDataCheck.get_null_column_information evalml.data_checks.null_data_check.NullDataCheck.get_null_row_information evalml.data_checks.null_data_check.NullDataCheck.name evalml.data_checks.null_data_check.NullDataCheck.validate .. py:method:: get_null_column_information(X, pct_null_col_threshold=0.0) :staticmethod: Finds columns that are considered highly null (percentage null is greater than threshold) and returns dictionary mapping column name to percentage null and dictionary mapping column name to null indices. :param X: DataFrame to check for highly null columns. :type X: pd.DataFrame :param pct_null_col_threshold: Percentage threshold for a column to be considered null. Defaults to 0.0. :type pct_null_col_threshold: float :returns: Tuple containing: dictionary mapping column name to its null percentage and dictionary mapping column name to null indices in that column. :rtype: tuple .. py:method:: get_null_row_information(X, pct_null_row_threshold=0.0) :staticmethod: Finds rows that are considered highly null (percentage null is greater than threshold). :param X: DataFrame to check for highly null rows. :type X: pd.DataFrame :param pct_null_row_threshold: Percentage threshold for a row to be considered null. Defaults to 0.0. :type pct_null_row_threshold: float :returns: Series containing the percentage null for each row. :rtype: pd.Series .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if there are any highly-null columns or rows in the input. :param X: Features. :type X: pd.DataFrame, np.ndarray :param y: Ignored. Defaults to None. :type y: pd.Series, np.ndarray :returns: dict with a DataCheckWarning if there are any highly-null columns or rows. :rtype: dict .. rubric:: Examples >>> import pandas as pd ... >>> class SeriesWrap(): ... def __init__(self, series): ... self.series = series ... ... def __eq__(self, series_2): ... return all(self.series.eq(series_2.series)) With pct_null_col_threshold set to 0.50, any column that has 50% or more of its observations set to null will be included in the warning, as well as the percentage of null values identified ("all_null": 1.0, "lots_of_null": 0.8). >>> df = pd.DataFrame({ ... "all_null": [None, pd.NA, None, None, None], ... "lots_of_null": [None, None, None, None, 5], ... "few_null": [1, 2, None, 2, 3], ... "no_null": [1, 2, 3, 4, 5] ... }) ... >>> highly_null_dc = NullDataCheck(pct_null_col_threshold=0.50) >>> assert highly_null_dc.validate(df) == [ ... { ... "message": "Column(s) 'all_null', 'lots_of_null' are 50.0% or more null", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": { ... "columns": ["all_null", "lots_of_null"], ... "rows": None, ... "pct_null_rows": {"all_null": 1.0, "lots_of_null": 0.8} ... }, ... "code": "HIGHLY_NULL_COLS", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NullDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["all_null", "lots_of_null"], "rows": None} ... } ... ] ... }, ... { ... "message": "Column(s) 'few_null' have between 20.0% and 50.0% null values", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": {"columns": ["few_null"], "rows": None}, ... "code": "COLS_WITH_NULL", ... "action_options": [ ... { ... "code": "IMPUTE_COL", ... "data_check_name": "NullDataCheck", ... "metadata": {"columns": ["few_null"], "rows": None, "is_target": False}, ... "parameters": { ... "impute_strategies": { ... "parameter_type": "column", ... "columns": { ... "few_null": { ... "impute_strategy": {"categories": ["mean", "most_frequent"], "type": "category", "default_value": "mean"} ... } ... } ... } ... } ... } ... ] ... } ... ] With pct_null_row_threshold set to 0.50, any row with 50% or more of its respective column values set to null will included in the warning, as well as the offending rows ("rows": [0, 1, 2, 3]). Since the default value for pct_null_col_threshold is 0.95, "all_null" is also included in the warnings since the percentage of null values in that row is over 95%. Since the default value for pct_moderately_null_col_threshold is 0.20, "few_null" is included as a "moderately null" column as it has a null column percentage of 20%. >>> highly_null_dc = NullDataCheck(pct_null_row_threshold=0.50) >>> validation_messages = highly_null_dc.validate(df) >>> validation_messages[0]["details"]["pct_null_cols"] = SeriesWrap(validation_messages[0]["details"]["pct_null_cols"]) >>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.75, 0.5])) >>> assert validation_messages == [ ... { ... "message": "4 out of 5 rows are 50.0% or more null", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": { ... "columns": None, ... "rows": [0, 1, 2, 3], ... "pct_null_cols": highly_null_rows ... }, ... "code": "HIGHLY_NULL_ROWS", ... "action_options": [ ... { ... "code": "DROP_ROWS", ... "data_check_name": "NullDataCheck", ... "parameters": {}, ... "metadata": {"columns": None, "rows": [0, 1, 2, 3]} ... } ... ] ... }, ... { ... "message": "Column(s) 'all_null' are 95.0% or more null", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": { ... "columns": ["all_null"], ... "rows": None, ... "pct_null_rows": {"all_null": 1.0} ... }, ... "code": "HIGHLY_NULL_COLS", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NullDataCheck", ... "metadata": {"columns": ["all_null"], "rows": None}, ... "parameters": {} ... } ... ] ... }, ... { ... "message": "Column(s) 'lots_of_null', 'few_null' have between 20.0% and 95.0% null values", ... "data_check_name": "NullDataCheck", ... "level": "warning", ... "details": {"columns": ["lots_of_null", "few_null"], "rows": None}, ... "code": "COLS_WITH_NULL", ... "action_options": [ ... { ... "code": "IMPUTE_COL", ... "data_check_name": "NullDataCheck", ... "metadata": {"columns": ["lots_of_null", "few_null"], "rows": None, "is_target": False}, ... "parameters": { ... "impute_strategies": { ... "parameter_type": "column", ... "columns": { ... "lots_of_null": {"impute_strategy": {"categories": ["mean", "most_frequent"], "type": "category", "default_value": "mean"}}, ... "few_null": {"impute_strategy": {"categories": ["mean", "most_frequent"], "type": "category", "default_value": "mean"}} ... } ... } ... } ... } ... ] ... } ... ]