multicollinearity_data_check

Data check to check if any set features are likely to be multicollinear.

Module Contents

Classes Summary

MulticollinearityDataCheck

Check if any set features are likely to be multicollinear.

Contents

class evalml.data_checks.multicollinearity_data_check.MulticollinearityDataCheck(threshold=0.9)[source]

Check if any set features are likely to be multicollinear.

Parameters

threshold (float) – The threshold to be considered. Defaults to 0.9.

Methods

name

Return a name describing the data check.

validate

Check if any set of features are likely to be multicollinear.

name(cls)

Return a name describing the data check.

validate(self, X, y=None)[source]

Check if any set of features are likely to be multicollinear.

Parameters
  • X (pd.DataFrame) – The input features to check.

  • y (pd.Series) – The target. Ignored.

Returns

dict with a DataCheckWarning if there are any potentially multicollinear columns.

Return type

dict

Example

>>> import pandas as pd

Columns in X that are highly correlated with each other will be identified using mutual information.

>>> col = pd.Series([1, 0, 2, 3, 4])
>>> X = pd.DataFrame({"col_1": col, "col_2": col * 3})
>>> y = pd.Series([1, 0, 0, 1, 0])
...
>>> multicollinearity_check = MulticollinearityDataCheck(threshold=1.0)
>>> assert multicollinearity_check.validate(X, y) == [
...     {
...         "message": "Columns are likely to be correlated: [('col_1', 'col_2')]",
...         "data_check_name": "MulticollinearityDataCheck",
...         "level": "warning",
...         "code": "IS_MULTICOLLINEAR",
...         "details": {"columns": [("col_1", "col_2")], "rows": None},
...         "action_options": []
...     }
... ]