Using Text Data with EvalML¶
In this demo, we will show you how to use EvalML to build models which use text data.
[1]:
import evalml
from evalml import AutoMLSearch
Dataset¶
We will be utilizing a dataset of SMS text messages, some of which are categorized as spam, and others which are not (“ham”). This dataset is originally from Kaggle, but modified to produce a slightly more even distribution of spam to ham.
[2]:
from urllib.request import urlopen
import pandas as pd
input_data = urlopen('https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv')
data = pd.read_csv(input_data)
X = data.drop(['Category'], axis=1)
y = data['Category']
display(X.head())
Message | |
---|---|
0 | Free entry in 2 a wkly comp to win FA Cup fina... |
1 | FreeMsg Hey there darling it's been 3 week's n... |
2 | WINNER!! As a valued network customer you have... |
3 | Had your mobile 11 months or more? U R entitle... |
4 | SIX chances to win CASH! From 100 to 20,000 po... |
The ham vs spam distribution of the data is 3:1, so any machine learning model must get above 75% accuracy in order to perform better than a trivial baseline model which simply classifies everything as ham.
[3]:
y.value_counts(normalize=True)
[3]:
ham 0.750084
spam 0.249916
Name: Category, dtype: float64
Search for best pipeline¶
In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.
[4]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='binary', test_size=0.2, random_seed=0)
EvalML uses Woodwork to automatically detect which columns are text columns, so you can run search normally, as you would if there was no text data. We can print out the logical type of the Message
column and assert that it is indeed inferred as a natural language column.
[5]:
X_train.ww
[5]:
Physical Type | Logical Type | Semantic Tag(s) | |
---|---|---|---|
Column | |||
Message | string | NaturalLanguage | [] |
Because the spam/ham labels are binary, we will use AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')
. When we call .search()
, the search for the best pipeline will begin.
[6]:
automl = AutoMLSearch(X_train=X_train, y_train=y_train,
problem_type='binary',
max_batches=1,
optimize_thresholds=True)
automl.search()
Generating pipelines to search over...
8 pipelines ready for search.
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of 9 pipelines.
Allowed model families: xgboost, lightgbm, random_forest, linear_model, extra_trees, decision_tree, catboost
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 8.638
*****************************
* Evaluating Batch Number 1 *
*****************************
Elastic Net Classifier w/ Text Featurization Component + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.222
High coefficient of variation (cv >= 0.2) within cross validation scores.
Elastic Net Classifier w/ Text Featurization Component + Standard Scaler may not perform as estimated on unseen data.
Decision Tree Classifier w/ Text Featurization Component:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 1.403
High coefficient of variation (cv >= 0.2) within cross validation scores.
Decision Tree Classifier w/ Text Featurization Component may not perform as estimated on unseen data.
Random Forest Classifier w/ Text Featurization Component:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.153
High coefficient of variation (cv >= 0.2) within cross validation scores.
Random Forest Classifier w/ Text Featurization Component may not perform as estimated on unseen data.
LightGBM Classifier w/ Text Featurization Component:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.213
High coefficient of variation (cv >= 0.2) within cross validation scores.
LightGBM Classifier w/ Text Featurization Component may not perform as estimated on unseen data.
Logistic Regression Classifier w/ Text Featurization Component + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.222
High coefficient of variation (cv >= 0.2) within cross validation scores.
Logistic Regression Classifier w/ Text Featurization Component + Standard Scaler may not perform as estimated on unseen data.
XGBoost Classifier w/ Text Featurization Component:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.172
High coefficient of variation (cv >= 0.2) within cross validation scores.
XGBoost Classifier w/ Text Featurization Component may not perform as estimated on unseen data.
Extra Trees Classifier w/ Text Featurization Component:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.256
CatBoost Classifier w/ Text Featurization Component:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.534
Search finished after 01:00
Best pipeline: Random Forest Classifier w/ Text Featurization Component
Best pipeline Log Loss Binary: 0.152939
View rankings and select pipeline¶
Once the fitting process is done, we can see all of the pipelines that were searched.
[7]:
automl.rankings
[7]:
id | pipeline_name | search_order | mean_cv_score | standard_deviation_cv_score | validation_score | percent_better_than_baseline | high_variance_cv | parameters | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | Random Forest Classifier w/ Text Featurization... | 3 | 0.152939 | 0.043389 | 0.113306 | 98.229529 | True | {'Random Forest Classifier': {'n_estimators': ... |
1 | 6 | XGBoost Classifier w/ Text Featurization Compo... | 6 | 0.171712 | 0.046956 | 0.120741 | 98.012206 | True | {'XGBoost Classifier': {'eta': 0.1, 'max_depth... |
2 | 4 | LightGBM Classifier w/ Text Featurization Comp... | 4 | 0.212777 | 0.056598 | 0.147567 | 97.536822 | True | {'LightGBM Classifier': {'boosting_type': 'gbd... |
3 | 1 | Elastic Net Classifier w/ Text Featurization C... | 1 | 0.221790 | 0.045718 | 0.172468 | 97.432481 | True | {'Elastic Net Classifier': {'penalty': 'elasti... |
4 | 5 | Logistic Regression Classifier w/ Text Featuri... | 5 | 0.221831 | 0.045594 | 0.172700 | 97.432008 | True | {'Logistic Regression Classifier': {'penalty':... |
5 | 7 | Extra Trees Classifier w/ Text Featurization C... | 7 | 0.255506 | 0.050486 | 0.224186 | 97.042171 | False | {'Extra Trees Classifier': {'n_estimators': 10... |
6 | 8 | CatBoost Classifier w/ Text Featurization Comp... | 8 | 0.533608 | 0.014352 | 0.523532 | 93.822767 | False | {'CatBoost Classifier': {'n_estimators': 10, '... |
7 | 2 | Decision Tree Classifier w/ Text Featurization... | 2 | 1.403436 | 0.553454 | 0.775794 | 83.753347 | True | {'Decision Tree Classifier': {'criterion': 'gi... |
8 | 0 | Mode Baseline Binary Classification Pipeline | 0 | 8.638305 | 0.025020 | 8.623860 | 0.000000 | False | {'Baseline Classifier': {'strategy': 'mode'}} |
To select the best pipeline we can call automl.best_pipeline
.
[8]:
best_pipeline = automl.best_pipeline
Describe pipeline¶
You can get more details about any pipeline, including how it performed on other objective functions.
[9]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])
************************************************************
* Random Forest Classifier w/ Text Featurization Component *
************************************************************
Problem Type: binary
Model Family: Random Forest
Pipeline Steps
==============
1. Text Featurization Component
2. Random Forest Classifier
* n_estimators : 100
* max_depth : 6
* n_jobs : -1
Training
========
Training for binary problems.
Total training time (including CV): 7.6 seconds
Cross Validation
----------------
Log Loss Binary MCC Binary AUC Precision F1 Balanced Accuracy Binary Accuracy Binary # Training # Validation
0 0.113 0.899 0.987 0.896 0.925 0.959 0.961 1,594 797
1 0.146 0.838 0.978 0.887 0.878 0.916 0.940 1,594 797
2 0.199 0.789 0.968 0.848 0.841 0.892 0.921 1,594 797
mean 0.153 0.842 0.978 0.877 0.881 0.923 0.941 - -
std 0.043 0.055 0.009 0.026 0.042 0.034 0.020 - -
coef of var 0.284 0.066 0.010 0.029 0.047 0.037 0.021 - -
[10]:
best_pipeline.graph()
[10]:
Notice above that there is a Text Featurization Component
as the first step in the pipeline. The Woodwork DataTable
passed in to AutoML search recognizes that 'Message'
is a text column, and converts this text into numerical values that can be handled by the estimator.
Evaluate on holdout¶
Now, we can score the pipeline on the holdout data using the core objectives for binary classification problems.
[11]:
scores = best_pipeline.score(X_holdout, y_holdout, objectives=evalml.objectives.get_core_objectives('binary'))
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')
Accuracy Binary: 0.9715719063545151
As you can see, this model performs relatively well on this dataset, even on unseen data.
Why encode text this way?¶
To demonstrate the importance of text-specific modeling, let’s train a model with the same dataset, without letting AutoMLSearch
detect the text column. We can change this by explicitly setting the data type of the 'Message'
column in Woodwork to Categorical
using the utility method infer_feature_types
.
[12]:
from evalml.utils import infer_feature_types
X = infer_feature_types(X, {'Message': 'Categorical'})
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='binary', test_size=0.2, random_seed=0)
[13]:
automl_no_text = AutoMLSearch(X_train=X_train, y_train=y_train,
problem_type='binary',
max_batches=1,
optimize_thresholds=True)
automl_no_text.search()
Generating pipelines to search over...
8 pipelines ready for search.
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of 9 pipelines.
Allowed model families: xgboost, lightgbm, random_forest, linear_model, extra_trees, decision_tree, catboost
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 8.638
*****************************
* Evaluating Batch Number 1 *
*****************************
Elastic Net Classifier w/ Imputer + One Hot Encoder + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.561
Decision Tree Classifier w/ Imputer + One Hot Encoder:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.561
Random Forest Classifier w/ Imputer + One Hot Encoder:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.561
LightGBM Classifier w/ Imputer + One Hot Encoder:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.562
Logistic Regression Classifier w/ Imputer + One Hot Encoder + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.561
XGBoost Classifier w/ Imputer + One Hot Encoder:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.562
Extra Trees Classifier w/ Imputer + One Hot Encoder:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.561
CatBoost Classifier w/ Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.631
Search finished after 00:06
Best pipeline: Logistic Regression Classifier w/ Imputer + One Hot Encoder + Standard Scaler
Best pipeline Log Loss Binary: 0.560995
Like before, we can look at the rankings and pick the best pipeline.
[14]:
automl_no_text.rankings
[14]:
id | pipeline_name | search_order | mean_cv_score | standard_deviation_cv_score | validation_score | percent_better_than_baseline | high_variance_cv | parameters | |
---|---|---|---|---|---|---|---|---|---|
0 | 5 | Logistic Regression Classifier w/ Imputer + On... | 5 | 0.560995 | 0.001805 | 0.560347 | 93.505723 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
1 | 1 | Elastic Net Classifier w/ Imputer + One Hot En... | 1 | 0.561000 | 0.001804 | 0.560350 | 93.505674 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
2 | 2 | Decision Tree Classifier w/ Imputer + One Hot ... | 2 | 0.561199 | 0.002170 | 0.560353 | 93.503373 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
3 | 3 | Random Forest Classifier w/ Imputer + One Hot ... | 3 | 0.561386 | 0.001712 | 0.560692 | 93.501207 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
4 | 7 | Extra Trees Classifier w/ Imputer + One Hot En... | 7 | 0.561418 | 0.001676 | 0.560625 | 93.500837 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
5 | 6 | XGBoost Classifier w/ Imputer + One Hot Encoder | 6 | 0.562262 | 0.000769 | 0.561991 | 93.491057 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
6 | 4 | LightGBM Classifier w/ Imputer + One Hot Encoder | 4 | 0.562452 | 0.000798 | 0.561991 | 93.488867 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
7 | 8 | CatBoost Classifier w/ Imputer | 8 | 0.631214 | 0.004016 | 0.633875 | 92.692846 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
8 | 0 | Mode Baseline Binary Classification Pipeline | 0 | 8.638305 | 0.025020 | 8.623860 | 0.000000 | False | {'Baseline Classifier': {'strategy': 'mode'}} |
[15]:
best_pipeline_no_text = automl_no_text.best_pipeline
Here, changing the data type of the text column removed the Text Featurization Component
from the pipeline.
[16]:
best_pipeline_no_text.graph()
[16]:
[17]:
automl_no_text.describe_pipeline(automl_no_text.rankings.iloc[0]["id"])
*********************************************************************************
* Logistic Regression Classifier w/ Imputer + One Hot Encoder + Standard Scaler *
*********************************************************************************
Problem Type: binary
Model Family: Linear
Pipeline Steps
==============
1. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : mean
* categorical_fill_value : None
* numeric_fill_value : None
2. One Hot Encoder
* top_n : 10
* features_to_encode : None
* categories : None
* drop : if_binary
* handle_unknown : ignore
* handle_missing : error
3. Standard Scaler
4. Logistic Regression Classifier
* penalty : l2
* C : 1.0
* n_jobs : -1
* multi_class : auto
* solver : lbfgs
Training
========
Training for binary problems.
Total training time (including CV): 0.6 seconds
Cross Validation
----------------
Log Loss Binary MCC Binary AUC Precision F1 Balanced Accuracy Binary Accuracy Binary # Training # Validation
0 0.560 0.000 0.503 0.000 0.000 0.500 0.750 1,594 797
1 0.560 0.000 0.504 0.000 0.000 0.500 0.750 1,594 797
2 0.563 0.000 0.502 0.000 0.000 0.500 0.749 1,594 797
mean 0.561 0.000 0.503 0.000 0.000 0.500 0.750 - -
std 0.002 0.000 0.001 0.000 0.000 0.000 0.001 - -
coef of var 0.003 inf 0.003 inf inf 0.000 0.001 - -
[18]:
# get standard performance metrics on holdout data
scores = best_pipeline_no_text.score(X_holdout, y_holdout, objectives=evalml.objectives.get_core_objectives('binary'))
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')
Accuracy Binary: 0.7508361204013378
Without the Text Featurization Component
, the 'Message'
column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the One Hot Encoder
. The best pipeline encoded the top 10 most frequent “categories” of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the best_pipeline_no_text
performs very
similarly to randomly guessing “ham” in every case.