id_columns_data_check¶
Data check that checks if any of the features are likely to be ID columns.
Module Contents¶
Classes Summary¶
Check if any of the features are likely to be ID columns. |
Contents¶
-
class
evalml.data_checks.id_columns_data_check.
IDColumnsDataCheck
(id_threshold=1.0)[source]¶ Check if any of the features are likely to be ID columns.
- Parameters
id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.
Methods
Return a name describing the data check.
Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.
Checks performed are:
column name is “id”
column name ends in “_id”
column contains all unique values (and is categorical / integer type)
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series) – The target. Defaults to None. Ignored.
- Returns
A dictionary of features with column name or index and their probability of being ID columns
- Return type
dict
Examples
>>> import pandas as pd
Columns that end in “_id” and are completely unique are likely to be ID columns.
>>> df = pd.DataFrame({ ... "customer_id": [123, 124, 125, 126, 127], ... "Sales": [10, 42, 31, 51, 61] ... }) ... >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == [ ... { ... "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "code": "HAS_ID_COLUMN", ... "details": {"columns": ["customer_id"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "IDColumnsDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["customer_id"], "rows": None} ... } ... ] ... } ... ]
Columns named “ID” with all unique values will also be identified as ID columns.
>>> df = df.rename(columns={"customer_id": "ID"}) >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == [ ... { ... "message": "Columns 'ID' are 100.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "code": "HAS_ID_COLUMN", ... "details": {"columns": ["ID"], "rows": None}, ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "IDColumnsDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["ID"], "rows": None} ... } ... ] ... } ... ]
Despite being all unique, “Country_Rank” will not be identified as an ID column as id_threshold is set to 1.0 by default and its name doesn’t indicate that it’s an ID.
>>> df = pd.DataFrame({ ... "Country_Rank": [1, 2, 3, 4, 5], ... "Sales": ["very high", "high", "high", "medium", "very low"] ... }) ... >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == []
However lowering the threshold will cause this column to be identified as an ID.
>>> id_col_check = IDColumnsDataCheck() >>> id_col_check = IDColumnsDataCheck(id_threshold=0.95) >>> assert id_col_check.validate(df) == [ ... { ... "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "details": {"columns": ["Country_Rank"], "rows": None}, ... "code": "HAS_ID_COLUMN", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "IDColumnsDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["Country_Rank"], "rows": None} ... } ... ] ... } ... ]