evalml.data_checks.NaturalLanguageNaNDataCheck.validate

NaturalLanguageNaNDataCheck.validate(X, y=None)[source]

Checks if any natural language columns contain NaN values.

Parameters
  • X (pd.DataFrame, np.ndarray) – Features.

  • y (pd.Series, np.ndarray) – Ignored. Defaults to None.

Returns

dict with a DataCheckError if NaN values are present in natural language columns.

Return type

dict

Example

>>> import pandas as pd
>>> import woodwork as ww
>>> import numpy as np
>>> data = pd.DataFrame()
>>> data['A'] = [None, "string_that_is_long_enough_for_natural_language"]
>>> data['B'] = ['string_that_is_long_enough_for_natural_language', 'string_that_is_long_enough_for_natural_language']
>>> data['C'] = np.random.randint(0, 3, size=len(data))
>>> data.ww.init(logical_types={'A': 'NaturalLanguage', 'B': 'NaturalLanguage'})
>>> nl_nan_check = NaturalLanguageNaNDataCheck()
>>> assert nl_nan_check.validate(data) == {
...        "warnings": [],
...        "actions": [],
...        "errors": [DataCheckError(message='Input natural language column(s) (A) contains NaN values. Please impute NaN values or drop these rows or columns.',
...                      data_check_name=NaturalLanguageNaNDataCheck.name,
...                      message_code=DataCheckMessageCode.NATURAL_LANGUAGE_HAS_NAN,
...                      details={"columns": 'A'}).to_dict()]
...    }