highly_null_data_check

Data check that checks if there are any highly-null columns and rows in the input.

Module Contents

Classes Summary

HighlyNullDataCheck

Check if there are any highly-null columns and rows in the input.

Contents

class evalml.data_checks.highly_null_data_check.HighlyNullDataCheck(pct_null_col_threshold=0.95, pct_null_row_threshold=0.95)[source]

Check if there are any highly-null columns and rows in the input.

Parameters
  • pct_null_col_threshold (float) – If the percentage of NaN values in an input feature exceeds this amount, that column will be considered highly-null. Defaults to 0.95.

  • pct_null_row_threshold (float) – If the percentage of NaN values in an input row exceeds this amount, that row will be considered highly-null. Defaults to 0.95.

Methods

name

Return a name describing the data check.

validate

Check if there are any highly-null columns or rows in the input.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if there are any highly-null columns or rows in the input.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckWarning if there are any highly-null columns or rows.

Return type

dict

Example

>>> import pandas as pd
>>> class SeriesWrap():
...     def __init__(self, series):
...         self.series = series
...
...     def __eq__(self, series_2):
...         return all(self.series.eq(series_2.series))
...
>>> df = pd.DataFrame({
...    'lots_of_null': [None, None, None, None, 5],
...    'no_null': [1, 2, 3, 4, 5]
... })
>>> null_check = HighlyNullDataCheck(pct_null_col_threshold=0.50, pct_null_row_threshold=0.50)
>>> validation_results = null_check.validate(df)
>>> validation_results['warnings'][0]['details']['pct_null_cols'] = SeriesWrap(validation_results['warnings'][0]['details']['pct_null_cols'])
>>> highly_null_rows = SeriesWrap(pd.Series([0.5, 0.5, 0.5, 0.5]))
>>> assert validation_results == {
...     "errors": [],
...     "warnings": [{"message": "4 out of 5 rows are more than 50.0% null",
...                   "data_check_name": "HighlyNullDataCheck",
...                   "level": "warning",
...                   "code": "HIGHLY_NULL_ROWS",
...                   "details": {"pct_null_cols": highly_null_rows, "columns": None, "rows": [0, 1, 2, 3]}},
...                  {"message": "Columns 'lots_of_null' are 50.0% or more null",
...                   "data_check_name": "HighlyNullDataCheck",
...                   "level": "warning",
...                   "code": "HIGHLY_NULL_COLS",
...                   "details": {"columns": ["lots_of_null"], "pct_null_rows": {"lots_of_null": 0.8}, "null_row_indices": {"lots_of_null": [0, 1, 2, 3]}, "rows": None}}],
...    "actions": [{"code": "DROP_ROWS", "metadata": {"rows": [0, 1, 2, 3], "columns": None}},
...                {"code": "DROP_COL", "metadata": {"columns": ["lots_of_null"], "rows": None}}]}