multicollinearity_data_check#
Data check to check if any set features are likely to be multicollinear.
Module Contents#
Classes Summary#
Check if any set features are likely to be multicollinear. |
Contents#
- class evalml.data_checks.multicollinearity_data_check.MulticollinearityDataCheck(threshold=0.9)[source]#
Check if any set features are likely to be multicollinear.
- Parameters
threshold (float) – The threshold to be considered. Defaults to 0.9.
Methods
Return a name describing the data check.
Check if any set of features are likely to be multicollinear.
- name(cls)#
Return a name describing the data check.
- validate(self, X, y=None)[source]#
Check if any set of features are likely to be multicollinear.
- Parameters
X (pd.DataFrame) – The input features to check.
y (pd.Series) – The target. Ignored.
- Returns
dict with a DataCheckWarning if there are any potentially multicollinear columns.
- Return type
dict
Example
>>> import pandas as pd
Columns in X that are highly correlated with each other will be identified using mutual information.
>>> col = pd.Series([1, 0, 2, 3, 4] * 15) >>> X = pd.DataFrame({"col_1": col, "col_2": col * 3}) >>> y = pd.Series([1, 0, 0, 1, 0] * 15) ... >>> multicollinearity_check = MulticollinearityDataCheck(threshold=1.0) >>> assert multicollinearity_check.validate(X, y) == [ ... { ... "message": "Columns are likely to be correlated: [('col_1', 'col_2')]", ... "data_check_name": "MulticollinearityDataCheck", ... "level": "warning", ... "code": "IS_MULTICOLLINEAR", ... "details": {"columns": [("col_1", "col_2")], "rows": None}, ... "action_options": [] ... } ... ]