multicollinearity_data_check ========================================================= .. py:module:: evalml.data_checks.multicollinearity_data_check .. autoapi-nested-parse:: Data check to check if any set features are likely to be multicollinear. Module Contents --------------- Classes Summary ~~~~~~~~~~~~~~~ .. autoapisummary:: evalml.data_checks.multicollinearity_data_check.MulticollinearityDataCheck Contents ~~~~~~~~~~~~~~~~~~~ .. py:class:: MulticollinearityDataCheck(threshold=0.9) Check if any set features are likely to be multicollinear. :param threshold: The threshold to be considered. Defaults to 0.9. :type threshold: float **Methods** .. autoapisummary:: :nosignatures: evalml.data_checks.multicollinearity_data_check.MulticollinearityDataCheck.name evalml.data_checks.multicollinearity_data_check.MulticollinearityDataCheck.validate .. py:method:: name(cls) Return a name describing the data check. .. py:method:: validate(self, X, y=None) Check if any set of features are likely to be multicollinear. :param X: The input features to check. :type X: pd.DataFrame :param y: The target. Ignored. :type y: pd.Series :returns: dict with a DataCheckWarning if there are any potentially multicollinear columns. :rtype: dict .. rubric:: Example >>> import pandas as pd Columns in X that are highly correlated with each other will be identified using mutual information. >>> col = pd.Series([1, 0, 2, 3, 4] * 15) >>> X = pd.DataFrame({"col_1": col, "col_2": col * 3}) >>> y = pd.Series([1, 0, 0, 1, 0] * 15) ... >>> multicollinearity_check = MulticollinearityDataCheck(threshold=1.0) >>> assert multicollinearity_check.validate(X, y) == [ ... { ... "message": "Columns are likely to be correlated: [('col_1', 'col_2')]", ... "data_check_name": "MulticollinearityDataCheck", ... "level": "warning", ... "code": "IS_MULTICOLLINEAR", ... "details": {"columns": [("col_1", "col_2")], "rows": None}, ... "action_options": [] ... } ... ]