highly_null_data_check

Data check that checks if there are any highly-null columns and rows in the input.

Module Contents

Classes Summary

HighlyNullDataCheck

Check if there are any highly-null columns and rows in the input.

Contents

class evalml.data_checks.highly_null_data_check.HighlyNullDataCheck(pct_null_col_threshold=0.95, pct_null_row_threshold=0.95)[source]

Check if there are any highly-null columns and rows in the input.

Parameters
  • pct_null_col_threshold (float) – If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95.

  • pct_null_row_threshold (float) – If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95.

Methods

name

Return a name describing the data check.

validate

Check if there are any highly-null columns or rows in the input.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if there are any highly-null columns or rows in the input.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any highly-null columns or rows.

Return type

dict

Examples

>>> import pandas as pd
...
>>> class SeriesWrap():
...     def __init__(self, series):
...         self.series = series
...
...     def __eq__(self, series_2):
...         return all(self.series.eq(series_2.series))
...
>>> df = pd.DataFrame({
...     'all_null': [None, pd.NA, None, None, None],
...     'lots_of_null': [None, None, None, None, 5],
...     'few_null': ["near", "far", pd.NaT, "wherever", "nowhere"],
...     'no_null': [1, 2, 3, 4, 5]
... })
...
>>> highly_null_dc = HighlyNullDataCheck(pct_null_col_threshold=0.50)
>>> assert highly_null_dc.validate(df) == {
...     'warnings': [{'message': "Columns 'all_null', 'lots_of_null' are 50.0% or more null",
...                   'data_check_name': 'HighlyNullDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['all_null', 'lots_of_null'],
...                               'rows': None,
...                               'pct_null_rows': {'all_null': 1.0, 'lots_of_null': 0.8},
...                               'null_row_indices': {'all_null': [0, 1, 2, 3, 4],
...                                                    'lots_of_null': [0, 1, 2, 3]}},
...                   'code': 'HIGHLY_NULL_COLS'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_COL',
...                  'metadata': {'columns': ['all_null', 'lots_of_null'], 'rows': None}}]}
...
...
>>> highly_null_dc = HighlyNullDataCheck(pct_null_row_threshold=0.50)
>>> validation_results = highly_null_dc.validate(df)
>>> validation_results['warnings'][0]['details']['pct_null_cols'] = SeriesWrap(validation_results['warnings'][0]['details']['pct_null_cols'])
>>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.75, 0.5]))
>>> assert validation_results == {
...     'warnings': [{'message': '4 out of 5 rows are 50.0% or more null',
...                   'data_check_name': 'HighlyNullDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': None,
...                               'rows': [0, 1, 2, 3],
...                               'pct_null_cols': highly_null_rows},
...                   'code': 'HIGHLY_NULL_ROWS'},
...                  {'message': "Columns 'all_null' are 95.0% or more null",
...                   'data_check_name': 'HighlyNullDataCheck',
...                   'level': 'warning',
...                   'details': {'columns': ['all_null'],
...                               'rows': None,
...                               'pct_null_rows': {'all_null': 1.0},
...                               'null_row_indices': {'all_null': [0, 1, 2, 3, 4]}},
...                   'code': 'HIGHLY_NULL_COLS'}],
...     'errors': [],
...     'actions': [{'code': 'DROP_ROWS',
...                  'metadata': {'columns': None, 'rows': [0, 1, 2, 3]}},
...                 {'code': 'DROP_COL',
...                  'metadata': {'columns': ['all_null'], 'rows': None}}]}