multicollinearity_data_check

Module Contents

Classes Summary

MulticollinearityDataCheck

Check if any set features are likely to be multicollinear.

Contents

class evalml.data_checks.multicollinearity_data_check.MulticollinearityDataCheck(threshold=0.9)[source]

Check if any set features are likely to be multicollinear.

Parameters

threshold (float) – The threshold to be considered. Defaults to 0.9.

Methods

name

Returns a name describing the data check.

validate

Check if any set of features are likely to be multicollinear.

name(cls)

Returns a name describing the data check.

validate(self, X, y=None)[source]

Check if any set of features are likely to be multicollinear.

Parameters
  • X (pd.DataFrame) – The input features to check.

  • y (pd.Series) – The target. Ignored.

Returns

dict with a DataCheckWarning if there are any potentially multicollinear columns.

Return type

dict

Example

>>> import pandas as pd
>>> col = pd.Series([1, 0, 2, 3, 4])
>>> X = pd.DataFrame({"col_1": col, "col_2": col * 3})
>>> y = pd.Series([1, 0, 0, 1, 0])
>>> multicollinearity_check = MulticollinearityDataCheck(threshold=0.8)
>>> assert multicollinearity_check.validate(X, y) == {
...     "errors": [],
...     "warnings": [{'message': "Columns are likely to be correlated: [('col_1', 'col_2')]",
...                   "data_check_name": "MulticollinearityDataCheck",
...                   "level": "warning",
...                   "code": "IS_MULTICOLLINEAR",
...                   'details': {'columns': [('col_1', 'col_2')]}}],
...     "actions": []}