id_columns_data_check¶

Data check that checks if any of the features are likely to be ID columns.

Module Contents¶

Classes Summary¶

IDColumnsDataCheck

Check if any of the features are likely to be ID columns.

Contents¶

class evalml.data_checks.id_columns_data_check.IDColumnsDataCheck(id_threshold=1.0)[source]¶

Check if any of the features are likely to be ID columns.

Parameters: id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

name(cls)¶: Return a name describing the data check.

validate(self, X, y=None)[source]¶

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

Checks performed are:

column name is “id”

column name ends in “_id”

column contains all unique values (and is categorical / integer type)

Parameters

X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series) – The target. Defaults to None. Ignored.

Returns

A dictionary of features with column name or index and their probability of being ID columns

Return type

dict

Examples

>>> import pandas as pd

Columns that end in “_id” and are completely unique are likely to be ID columns.

>>> df = pd.DataFrame({
...     "customer_id": [123, 124, 125, 126, 127],
...     "Sales": [10, 42, 31, 51, 61]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_COLUMN",
...         "details": {"columns": ["customer_id"], "rows": None},
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["customer_id"], "rows": None}
...             }
...         ]
...    }
... ]

Columns named “ID” with all unique values will also be identified as ID columns.

>>> df = df.rename(columns={"customer_id": "ID"})
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'ID' are 100.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_COLUMN",
...         "details": {"columns": ["ID"], "rows": None},
...         "action_options": [
...            {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["ID"], "rows": None}
...             }
...         ]
...     }
... ]

Despite being all unique, “Country_Rank” will not be identified as an ID column as id_threshold is set to 1.0 by default and its name doesn’t indicate that it’s an ID.

>>> df = pd.DataFrame({
...    "Country_Rank": [1, 2, 3, 4, 5],
...    "Sales": ["very high", "high", "high", "medium", "very low"]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == []

However lowering the threshold will cause this column to be identified as an ID.

>>> id_col_check = IDColumnsDataCheck()
>>> id_col_check = IDColumnsDataCheck(id_threshold=0.95)
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "details": {"columns": ["Country_Rank"], "rows": None},
...         "code": "HAS_ID_COLUMN",
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["Country_Rank"], "rows": None}
...             }
...         ]
...     }
... ]

default_data_checks

invalid_target_data_check