id_columns_data_check¶
Data check that checks if any of the features are likely to be ID columns.
Module Contents¶
Classes Summary¶
Check if any of the features are likely to be ID columns. |
Contents¶
-
class
evalml.data_checks.id_columns_data_check.
IDColumnsDataCheck
(id_threshold=1.0)[source]¶ Check if any of the features are likely to be ID columns.
- Parameters
id_threshold (float) – The probability threshold to be considered an ID column. Defaults to 1.0.
Methods
Return a name describing the data check.
Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.
-
name
(cls)¶ Return a name describing the data check.
-
validate
(self, X, y=None)[source]¶ Check if any of the features are likely to be ID columns. Currently performs a number of simple checks.
Checks performed are:
column name is “id”
column name ends in “_id”
column contains all unique values (and is categorical / integer type)
- Parameters
X (pd.DataFrame, np.ndarray) – The input features to check.
y (pd.Series) – The target. Defaults to None. Ignored.
- Returns
A dictionary of features with column name or index and their probability of being ID columns
- Return type
dict
Example
>>> import pandas as pd >>> df = pd.DataFrame({ ... 'df_id': [0, 1, 2, 3, 4], ... 'x': [10, 42, 31, 51, 61], ... 'y': [42, 54, 12, 64, 12] ... }) >>> id_col_check = IDColumnsDataCheck() >>> assert id_col_check.validate(df) == { ... "errors": [], ... "warnings": [{"message": "Columns 'df_id' are 100.0% or more likely to be an ID column", ... "data_check_name": "IDColumnsDataCheck", ... "level": "warning", ... "code": "HAS_ID_COLUMN", ... "details": {"columns": ["df_id"], "rows": None}}], ... "actions": [{"code": "DROP_COL", ... "metadata": {"columns": ["df_id"], "rows": None}}]}