highly_null_data_check¶
Data check that checks if there are any highly-null columns and rows in the input.
Module Contents¶
Classes Summary¶
Check if there are any highly-null columns and rows in the input. |
Contents¶
-
class
evalml.data_checks.highly_null_data_check.
HighlyNullDataCheck
(pct_null_col_threshold=0.95, pct_null_row_threshold=0.95)[source]¶ Check if there are any highly-null columns and rows in the input.
- Parameters
pct_null_col_threshold (float) – If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95.
pct_null_row_threshold (float) – If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95.
Methods
Finds columns that are considered highly null (percentage null is greater than threshold) and returns dictionary mapping column name to percentage null and dictionary mapping column name to null indices.
Finds rows that are considered highly null (percentage null is greater than threshold).
Return a name describing the data check.
Check if there are any highly-null columns or rows in the input.
-
static
get_null_column_information
(X, pct_null_col_threshold=0.0)[source]¶ Finds columns that are considered highly null (percentage null is greater than threshold) and returns dictionary mapping column name to percentage null and dictionary mapping column name to null indices.
- Parameters
X (pd.DataFrame) – DataFrame to check for highly null columns.
pct_null_col_threshold (float) – Percentage threshold for a column to be considered null. Defaults to 0.0.
- Returns
Tuple containing: dictionary mapping column name to its null percentage and dictionary mapping column name to null indices in that column.
- Return type
tuple
-
static
get_null_row_information
(X, pct_null_row_threshold=0.0)[source]¶ Finds rows that are considered highly null (percentage null is greater than threshold).
- Parameters
X (pd.DataFrame) – DataFrame to check for highly null rows.
pct_null_row_threshold (float) – Percentage threshold for a row to be considered null. Defaults to 0.0.
- Returns
Series containing the percentage null for each row.
- Return type
pd.Series
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if there are any highly-null columns or rows in the input.
- Parameters
X (pd.DataFrame, np.ndarray) – Features.
y (pd.Series, np.ndarray) – Ignored. Defaults to None.
- Returns
dict with a DataCheckWarning if there are any highly-null columns or rows.
- Return type
dict
Examples
>>> import pandas as pd ... >>> class SeriesWrap(): ... def __init__(self, series): ... self.series = series ... ... def __eq__(self, series_2): ... return all(self.series.eq(series_2.series)) ... >>> df = pd.DataFrame({ ... 'all_null': [None, pd.NA, None, None, None], ... 'lots_of_null': [None, None, None, None, 5], ... 'few_null': ["near", "far", pd.NaT, "wherever", "nowhere"], ... 'no_null': [1, 2, 3, 4, 5] ... }) ... >>> highly_null_dc = HighlyNullDataCheck(pct_null_col_threshold=0.50) >>> assert highly_null_dc.validate(df) == { ... 'warnings': [{'message': "Columns 'all_null', 'lots_of_null' are 50.0% or more null", ... 'data_check_name': 'HighlyNullDataCheck', ... 'level': 'warning', ... 'details': {'columns': ['all_null', 'lots_of_null'], ... 'rows': None, ... 'pct_null_rows': {'all_null': 1.0, 'lots_of_null': 0.8}}, ... 'code': 'HIGHLY_NULL_COLS'}], ... 'errors': [], ... 'actions': [{'code': 'DROP_COL', ... 'data_check_name': 'HighlyNullDataCheck', ... 'metadata': {'columns': ['all_null', 'lots_of_null'], 'rows': None}}]} ... ... >>> highly_null_dc = HighlyNullDataCheck(pct_null_row_threshold=0.50) >>> validation_results = highly_null_dc.validate(df) >>> validation_results['warnings'][0]['details']['pct_null_cols'] = SeriesWrap(validation_results['warnings'][0]['details']['pct_null_cols']) >>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.75, 0.5])) >>> assert validation_results == { ... 'warnings': [{'message': '4 out of 5 rows are 50.0% or more null', ... 'data_check_name': 'HighlyNullDataCheck', ... 'level': 'warning', ... 'details': {'columns': None, ... 'rows': [0, 1, 2, 3], ... 'pct_null_cols': highly_null_rows}, ... 'code': 'HIGHLY_NULL_ROWS'}, ... {'message': "Columns 'all_null' are 95.0% or more null", ... 'data_check_name': 'HighlyNullDataCheck', ... 'level': 'warning', ... 'details': {'columns': ['all_null'], ... 'rows': None, ... 'pct_null_rows': {'all_null': 1.0}}, ... 'code': 'HIGHLY_NULL_COLS'}], ... 'errors': [], ... 'actions': [{'code': 'DROP_ROWS', ... 'data_check_name': 'HighlyNullDataCheck', ... 'metadata': {'columns': None, 'rows': [0, 1, 2, 3]}}, ... {'code': 'DROP_COL', ... 'data_check_name': 'HighlyNullDataCheck', ... 'metadata': {'columns': ['all_null'], 'rows': None}}]}