id_columns_data_check

Data check that checks if any of the features are likely to be ID columns.

Module Contents

Classes Summary

IDColumnsDataCheck

Check if any of the features are likely to be ID columns.

Contents

class evalml.data_checks.id_columns_data_check.IDColumnsDataCheck(id_threshold=1.0)[source]

Check if any of the features are likely to be ID columns.

Parameters

id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.

Methods

name

Return a name describing the data check.

validate

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

Checks performed are:

  • column name is “id”

  • column name ends in “_id”

  • column contains all unique values (and is categorical / integer type)

Parameters
  • X (pd.DataFrame, np.ndarray) – The input features to check.

  • y (pd.Series) – The target. Defaults to None. Ignored.

Returns

A dictionary of features with column name or index and their probability of being ID columns

Return type

dict

Example

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'df_id': [0, 1, 2, 3, 4],
...     'x': [10, 42, 31, 51, 61],
...     'y': [42, 54, 12, 64, 12]
... })
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == {
...     "errors": [],
...     "warnings": [{"message": "Column 'df_id' is 100.0% or more likely to be an ID column",
...                   "data_check_name": "IDColumnsDataCheck",
...                   "level": "warning",
...                   "code": "HAS_ID_COLUMN",
...                   "details": {"column": "df_id"}}],
...     "actions": [{"code": "DROP_COL",
...                  "metadata": {"column": "df_id"}}]}