multicollinearity_data_check#

Data check to check if any set features are likely to be multicollinear.

Module Contents#

Classes Summary#

MulticollinearityDataCheck

Check if any set features are likely to be multicollinear.

Contents#

class evalml.data_checks.multicollinearity_data_check.MulticollinearityDataCheck(threshold=0.9)[source]#

Check if any set features are likely to be multicollinear.

Parameters: threshold (float) – The threshold to be considered. Defaults to 0.9.

Methods

`name`	Return a name describing the data check.
`validate`	Check if any set of features are likely to be multicollinear.

name(cls)#: Return a name describing the data check.

validate(self, X, y=None)[source]#

Check if any set of features are likely to be multicollinear.

Parameters

X (pd.DataFrame) – The input features to check.
y (pd.Series) – The target. Ignored.

Returns

dict with a DataCheckWarning if there are any potentially multicollinear columns.

Return type

dict

Example

>>> import pandas as pd

Columns in X that are highly correlated with each other will be identified using mutual information.

>>> col = pd.Series([1, 0, 2, 3, 4] * 15)
>>> X = pd.DataFrame({"col_1": col, "col_2": col * 3})
>>> y = pd.Series([1, 0, 0, 1, 0] * 15)
...
>>> multicollinearity_check = MulticollinearityDataCheck(threshold=1.0)
>>> assert multicollinearity_check.validate(X, y) == [
...     {
...         "message": "Columns are likely to be correlated: [('col_1', 'col_2')]",
...         "data_check_name": "MulticollinearityDataCheck",
...         "level": "warning",
...         "code": "IS_MULTICOLLINEAR",
...         "details": {"columns": [("col_1", "col_2")], "rows": None},
...         "action_options": []
...     }
... ]