id_columns_data_check#

Data check that checks if any of the features are likely to be ID columns.

Module Contents#

Classes Summary#

IDColumnsDataCheck

Check if any of the features are likely to be ID columns.

Contents#

class evalml.data_checks.id_columns_data_check.IDColumnsDataCheck(id_threshold=1.0, exclude_time_index=True)[source]#

Check if any of the features are likely to be ID columns.

Parameters
  • id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.

  • exclude_time_index (bool) – If True, the column set as the time index will not be included in the data check. Default is True.

Methods

name

Return a name describing the data check.

validate

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

name(cls)#

Return a name describing the data check.

validate(self, X, y=None)[source]#

Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.

Checks performed are:

  • column name is “id”

  • column name ends in “_id”

  • column contains all unique values (and is categorical / integer type)

Parameters
  • X (pd.DataFrame, np.ndarray) – The input features to check.

  • y (pd.Series) – The target. Defaults to None. Ignored.

Returns

A dictionary of features with column name or index and their probability of being ID columns

Return type

dict

Examples

>>> import pandas as pd

Columns that end in “_id” and are completely unique are likely to be ID columns.

>>> df = pd.DataFrame({
...     "profits": [25, 15, 15, 31, 19],
...     "customer_id": [123, 124, 125, 126, 127],
...     "Sales": [10, 42, 31, 51, 61]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_COLUMN",
...         "details": {"columns": ["customer_id"], "rows": None},
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["customer_id"], "rows": None}
...             }
...         ]
...    }
... ]

Columns named “ID” with all unique values will also be identified as ID columns.

>>> df = df.rename(columns={"customer_id": "ID"})
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'ID' are 100.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_COLUMN",
...         "details": {"columns": ["ID"], "rows": None},
...         "action_options": [
...            {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["ID"], "rows": None}
...             }
...         ]
...     }
... ]

Despite being all unique, “Country_Rank” will not be identified as an ID column as id_threshold is set to 1.0 by default and its name doesn’t indicate that it’s an ID.

>>> df = pd.DataFrame({
...    "humidity": ["high", "very high", "low", "low", "high"],
...    "Country_Rank": [1, 2, 3, 4, 5],
...    "Sales": ["very high", "high", "high", "medium", "very low"]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == []

However lowering the threshold will cause this column to be identified as an ID.

>>> id_col_check = IDColumnsDataCheck()
>>> id_col_check = IDColumnsDataCheck(id_threshold=0.95)
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "details": {"columns": ["Country_Rank"], "rows": None},
...         "code": "HAS_ID_COLUMN",
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["Country_Rank"], "rows": None}
...             }
...         ]
...     }
... ]

If the first column of the dataframe has all unique values and is named either ‘ID’ or a name that ends with ‘_id’, it is probably the primary key. The other ID columns should be dropped.

>>> df = pd.DataFrame({
...     "sales_id": [0, 1, 2, 3, 4],
...     "customer_id": [123, 124, 125, 126, 127],
...     "Sales": [10, 42, 31, 51, 61]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "The first column 'sales_id' is likely to be the primary key",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_FIRST_COLUMN",
...         "details": {"columns": ["sales_id"], "rows": None},
...         "action_options": [
...             {
...                 "code": "SET_FIRST_COL_ID",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["sales_id"], "rows": None}
...             }
...         ]
...    },
...    {
...        "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_COLUMN",
...         "details": {"columns": ["customer_id"], "rows": None},
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["customer_id"], "rows": None}
...             }
...         ]
...    }
... ]