# Using Text Data with EvalML¶

In this demo, we will show you how to use EvalML to build models which use text data.

[1]:

import evalml
from evalml import AutoMLSearch


## Dataset¶

We will be utilizing a dataset of SMS text messages, some of which are categorized as spam, and others which are not (“ham”). This dataset is originally from Kaggle, but modified to produce a slightly more even distribution of spam to ham.

[2]:

from urllib.request import urlopen
import pandas as pd

input_data = urlopen('https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv')

X = data.drop(['Category'], axis=1)
y = data['Category']


Message
0 Free entry in 2 a wkly comp to win FA Cup fina...
1 FreeMsg Hey there darling it's been 3 week's n...
2 WINNER!! As a valued network customer you have...
4 SIX chances to win CASH! From 100 to 20,000 po...

The ham vs spam distribution of the data is 3:1, so any machine learning model must get above 75% accuracy in order to perform better than a trivial baseline model which simply classifies everything as ham.

[3]:

y.value_counts(normalize=True)

[3]:

spam    0.593333
ham     0.406667
Name: Category, dtype: float64


In order to properly utilize Woodwork’s ‘Natural Language’ typing, we need to pass this argument in during initialization. Otherwise, this will be treated as an ‘Unknown’ type and dropped in the search.

[4]:

X.ww.init(logical_types={"Message": "NaturalLanguage"})


## Search for best pipeline¶

In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.

[5]:

X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='binary', test_size=0.2, random_seed=0)


EvalML uses Woodwork to automatically detect which columns are text columns, so you can run search normally, as you would if there was no text data. We can print out the logical type of the Message column and assert that it is indeed inferred as a natural language column.

[6]:

X_train.ww

[6]:

Physical Type Logical Type Semantic Tag(s)
Column
Message string NaturalLanguage []

Because the spam/ham labels are binary, we will use AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary'). When we call .search(), the search for the best pipeline will begin.

[7]:

automl = AutoMLSearch(X_train=X_train, y_train=y_train,
problem_type='binary',
max_batches=1,
optimize_thresholds=True)

automl.search()

Generating pipelines to search over...

*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of 9 pipelines.
Allowed model families: extra_trees, xgboost, decision_tree, random_forest, catboost, lightgbm, linear_model


Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 14.046

*****************************
* Evaluating Batch Number 1 *
*****************************

Elastic Net Classifier w/ Text Featurization Component + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.350
High coefficient of variation (cv >= 0.2) within cross validation scores.
Elastic Net Classifier w/ Text Featurization Component + Imputer + Standard Scaler may not perform as estimated on unseen data.
Decision Tree Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 3.386
Random Forest Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.221
LightGBM Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.292
High coefficient of variation (cv >= 0.2) within cross validation scores.
LightGBM Classifier w/ Text Featurization Component + Imputer may not perform as estimated on unseen data.
Logistic Regression Classifier w/ Text Featurization Component + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.350
High coefficient of variation (cv >= 0.2) within cross validation scores.
Logistic Regression Classifier w/ Text Featurization Component + Imputer + Standard Scaler may not perform as estimated on unseen data.
[22:17:38] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[22:17:39] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[22:17:40] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBoost Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.266
High coefficient of variation (cv >= 0.2) within cross validation scores.
XGBoost Classifier w/ Text Featurization Component + Imputer may not perform as estimated on unseen data.
Extra Trees Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.292
CatBoost Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.589

Search finished after 00:26
Best pipeline: Random Forest Classifier w/ Text Featurization Component + Imputer
Best pipeline Log Loss Binary: 0.221422


### View rankings and select pipeline¶

Once the fitting process is done, we can see all of the pipelines that were searched.

[8]:

automl.rankings

[8]:

id pipeline_name search_order mean_cv_score standard_deviation_cv_score validation_score percent_better_than_baseline high_variance_cv parameters
0 3 Random Forest Classifier w/ Text Featurization... 3 0.221422 0.040958 0.221587 98.423568 False {'Imputer': {'categorical_impute_strategy': 'm...
1 6 XGBoost Classifier w/ Text Featurization Compo... 6 0.266164 0.106501 0.242896 98.105025 True {'Imputer': {'categorical_impute_strategy': 'm...
2 4 LightGBM Classifier w/ Text Featurization Comp... 4 0.291768 0.114862 0.291521 97.922737 True {'Imputer': {'categorical_impute_strategy': 'm...
3 7 Extra Trees Classifier w/ Text Featurization C... 7 0.292373 0.029893 0.325764 97.918427 False {'Imputer': {'categorical_impute_strategy': 'm...
4 5 Logistic Regression Classifier w/ Text Featuri... 5 0.350340 0.074833 0.349271 97.505728 True {'Imputer': {'categorical_impute_strategy': 'm...
5 1 Elastic Net Classifier w/ Text Featurization C... 1 0.350471 0.074886 0.349437 97.504795 True {'Imputer': {'categorical_impute_strategy': 'm...
6 8 CatBoost Classifier w/ Text Featurization Comp... 8 0.588944 0.004016 0.592259 95.806967 False {'Imputer': {'categorical_impute_strategy': 'm...
7 2 Decision Tree Classifier w/ Text Featurization... 2 3.385551 0.672118 3.708759 75.896294 False {'Imputer': {'categorical_impute_strategy': 'm...
8 0 Mode Baseline Binary Classification Pipeline 0 14.045769 0.099705 13.988204 0.000000 False {'Baseline Classifier': {'strategy': 'mode'}}

To select the best pipeline we can call automl.best_pipeline.

[9]:

best_pipeline = automl.best_pipeline


### Describe pipeline¶

You can get more details about any pipeline, including how it performed on other objective functions.

[10]:

automl.describe_pipeline(automl.rankings.iloc[0]["id"])


**********************************************************************
* Random Forest Classifier w/ Text Featurization Component + Imputer *
**********************************************************************

Problem Type: binary
Model Family: Random Forest

Pipeline Steps
==============
1. Text Featurization Component
2. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : mean
* categorical_fill_value : None
* numeric_fill_value : None
3. Random Forest Classifier
* n_estimators : 100
* max_depth : 6
* n_jobs : -1

Training
========
Training for binary problems.
Total training time (including CV): 3.1 seconds

Cross Validation
----------------
Log Loss Binary  MCC Binary  Gini   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary # Training # Validation
0                      0.222       0.817 0.950 0.975      0.862 0.893                     0.913            0.910        400          200
1                      0.180       0.875 0.970 0.985      0.937 0.925                     0.936            0.940        400          200
2                      0.262       0.783 0.925 0.963      0.918 0.865                     0.883            0.895        400          200
mean                   0.221       0.825 0.948 0.974      0.906 0.894                     0.910            0.915          -            -
std                    0.041       0.047 0.023 0.011      0.039 0.030                     0.026            0.023          -            -
coef of var            0.185       0.057 0.024 0.012      0.043 0.034                     0.029            0.025          -            -

[11]:

best_pipeline.graph()

[11]:


Notice above that there is a Text Featurization Component as the first step in the pipeline. AutoMLSearch uses the woodwork accessor to recognize that 'Message' is a text column, and converts this text into numerical values that can be handled by the estimator.

## Evaluate on holdout¶

Now, we can score the pipeline on the holdout data using the core objectives for binary classification problems.

[12]:

scores = best_pipeline.score(X_holdout, y_holdout,  objectives=evalml.objectives.get_core_objectives('binary'))
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')

Accuracy Binary: 0.96


As you can see, this model performs relatively well on this dataset, even on unseen data.

## What does the Text Featurization Component do?¶

Machine learning models cannot handle non-numeric data. Any text must be broken down into numeric features that provide useful information about that text. The Text Featurization component first normalizes your text by removing any punctuation and other non-alphanumeric characters and converting any capital letters to lowercase. From there, it passes the text into featuretoolsnlp_primitives dfs search, resulting in several informative features that replace the original column in your dataset: Diversity Score, Mean Characters per Word, Polarity Score, and LSA (Latent Semantic Analysis).

Diversity Score is the ratio of unique words to total words.

Mean Characters per Word is the average number of letters in each word.

Polarity Score is a prediction of how “polarized” the text is, on a scale from -1 (extremely negative) to 1 (extremely positive).

Latent Semantic Analysis is an abstract representation of how important each word is with respect to the entire text, reduced down into two values per text. While the other text features are each a single column, this feature adds two columns to your data, LSA(column_name)[0] and LSA(column_name)[1].

Let’s see what this looks like with our spam/ham example.

[13]:

best_pipeline.input_feature_names

[13]:

{'Text Featurization Component': ['Message'],
'Imputer': ['DIVERSITY_SCORE(Message)',
'MEAN_CHARACTERS_PER_WORD(Message)',
'POLARITY_SCORE(Message)',
'LSA(Message)[0]',
'LSA(Message)[1]'],
'Random Forest Classifier': ['DIVERSITY_SCORE(Message)',
'MEAN_CHARACTERS_PER_WORD(Message)',
'POLARITY_SCORE(Message)',
'LSA(Message)[0]',
'LSA(Message)[1]']}


Here, the Text Featurization component takes in a single “Message” column, but then the next component in the pipeline, the Imputer, receives five columns of input. These five columns are the result of featurizing the text-type “Message” column. Most importantly, these featurized columns are what ends up passed in to the estimator.

If the dataset had any non-text columns, those would be left alone by this process. If the dataset had more than one text column, each would be broken into these five feature columns independently.

### The features, more directly¶

Rather than just checking the new column names, let’s examine the output of this component directly. We can see this by running the component on its own.

[14]:

text_featurizer = evalml.pipelines.components.TextFeaturizer()
X_featurized = text_featurizer.fit_transform(X_train)


Now we can compare the input data to the output from the text featurizer:

[15]:

X_train.head()

[15]:

Message
296 Sunshine Hols. To claim ur med holiday send a ...
652 Yup ü not comin :-(
526 Hello hun how ru? Its here by the way. Im good...
571 I tagged MY friends that you seemed to count a...
472 What happened to our yo date?
[16]:

X_featurized.head()

[16]:

DIVERSITY_SCORE(Message) MEAN_CHARACTERS_PER_WORD(Message) POLARITY_SCORE(Message) LSA(Message)[0] LSA(Message)[1]
296 1.0 4.344828 0.003 0.150556 -0.072443
652 1.0 3.000000 0.000 0.017340 -0.005411
526 1.0 3.363636 0.162 0.169954 0.022670
571 0.8 4.083333 0.681 0.144713 0.036799
472 1.0 3.833333 0.000 0.109373 -0.042754

These numeric values now represent important information about the original text that the estimator at the end of the pipeline can successfully use to make predictions.

## Why encode text this way?¶

To demonstrate the importance of text-specific modeling, let’s train a model with the same dataset, without letting AutoMLSearch detect the text column. We can change this by explicitly setting the data type of the 'Message' column in Woodwork to Categorical using the utility method infer_feature_types.

[17]:

from evalml.utils import infer_feature_types
X = infer_feature_types(X, {'Message': 'Categorical'})
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='binary', test_size=0.2, random_seed=0)

[18]:

automl_no_text = AutoMLSearch(X_train=X_train, y_train=y_train,
problem_type='binary',
max_batches=1,
optimize_thresholds=True)

automl_no_text.search()

Generating pipelines to search over...

*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of 9 pipelines.
Allowed model families: extra_trees, xgboost, decision_tree, random_forest, catboost, lightgbm, linear_model


Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 14.046

*****************************
* Evaluating Batch Number 1 *
*****************************

Elastic Net Classifier w/ Text Featurization Component + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.350
High coefficient of variation (cv >= 0.2) within cross validation scores.
Elastic Net Classifier w/ Text Featurization Component + Imputer + Standard Scaler may not perform as estimated on unseen data.
Decision Tree Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 3.386
Random Forest Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.221
LightGBM Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.292
High coefficient of variation (cv >= 0.2) within cross validation scores.
LightGBM Classifier w/ Text Featurization Component + Imputer may not perform as estimated on unseen data.
Logistic Regression Classifier w/ Text Featurization Component + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.350
High coefficient of variation (cv >= 0.2) within cross validation scores.
Logistic Regression Classifier w/ Text Featurization Component + Imputer + Standard Scaler may not perform as estimated on unseen data.
[22:18:04] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[22:18:05] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[22:18:06] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBoost Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.266
High coefficient of variation (cv >= 0.2) within cross validation scores.
XGBoost Classifier w/ Text Featurization Component + Imputer may not perform as estimated on unseen data.
Extra Trees Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.292
CatBoost Classifier w/ Text Featurization Component + Imputer:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.589

Search finished after 00:23
Best pipeline: Random Forest Classifier w/ Text Featurization Component + Imputer
Best pipeline Log Loss Binary: 0.221422


Like before, we can look at the rankings and pick the best pipeline.

[19]:

automl_no_text.rankings

[19]:

id pipeline_name search_order mean_cv_score standard_deviation_cv_score validation_score percent_better_than_baseline high_variance_cv parameters
0 3 Random Forest Classifier w/ Text Featurization... 3 0.221422 0.040958 0.221587 98.423568 False {'Imputer': {'categorical_impute_strategy': 'm...
1 6 XGBoost Classifier w/ Text Featurization Compo... 6 0.266164 0.106501 0.242896 98.105025 True {'Imputer': {'categorical_impute_strategy': 'm...
2 4 LightGBM Classifier w/ Text Featurization Comp... 4 0.291768 0.114862 0.291521 97.922737 True {'Imputer': {'categorical_impute_strategy': 'm...
3 7 Extra Trees Classifier w/ Text Featurization C... 7 0.292373 0.029893 0.325764 97.918427 False {'Imputer': {'categorical_impute_strategy': 'm...
4 5 Logistic Regression Classifier w/ Text Featuri... 5 0.350340 0.074833 0.349271 97.505728 True {'Imputer': {'categorical_impute_strategy': 'm...
5 1 Elastic Net Classifier w/ Text Featurization C... 1 0.350471 0.074886 0.349437 97.504795 True {'Imputer': {'categorical_impute_strategy': 'm...
6 8 CatBoost Classifier w/ Text Featurization Comp... 8 0.588944 0.004016 0.592259 95.806967 False {'Imputer': {'categorical_impute_strategy': 'm...
7 2 Decision Tree Classifier w/ Text Featurization... 2 3.385551 0.672118 3.708759 75.896294 False {'Imputer': {'categorical_impute_strategy': 'm...
8 0 Mode Baseline Binary Classification Pipeline 0 14.045769 0.099705 13.988204 0.000000 False {'Baseline Classifier': {'strategy': 'mode'}}
[20]:

best_pipeline_no_text = automl_no_text.best_pipeline


Here, changing the data type of the text column removed the Text Featurization Component from the pipeline.

[21]:

best_pipeline_no_text.graph()

[21]:

[22]:

automl_no_text.describe_pipeline(automl_no_text.rankings.iloc[0]["id"])


**********************************************************************
* Random Forest Classifier w/ Text Featurization Component + Imputer *
**********************************************************************

Problem Type: binary
Model Family: Random Forest

Pipeline Steps
==============
1. Text Featurization Component
2. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : mean
* categorical_fill_value : None
* numeric_fill_value : None
3. Random Forest Classifier
* n_estimators : 100
* max_depth : 6
* n_jobs : -1

Training
========
Training for binary problems.
Total training time (including CV): 3.1 seconds

Cross Validation
----------------
Log Loss Binary  MCC Binary  Gini   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary # Training # Validation
0                      0.222       0.817 0.950 0.975      0.862 0.893                     0.913            0.910        400          200
1                      0.180       0.875 0.970 0.985      0.937 0.925                     0.936            0.940        400          200
2                      0.262       0.783 0.925 0.963      0.918 0.865                     0.883            0.895        400          200
mean                   0.221       0.825 0.948 0.974      0.906 0.894                     0.910            0.915          -            -
std                    0.041       0.047 0.023 0.011      0.039 0.030                     0.026            0.023          -            -
coef of var            0.185       0.057 0.024 0.012      0.043 0.034                     0.029            0.025          -            -

[23]:

# get standard performance metrics on holdout data
scores = best_pipeline_no_text.score(X_holdout, y_holdout, objectives=evalml.objectives.get_core_objectives('binary'))
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')

Accuracy Binary: 0.96


Without the Text Featurization Component, the 'Message' column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the One Hot Encoder. The best pipeline encoded the top 10 most frequent “categories” of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the best_pipeline_no_text performs very similarly to randomly guessing “ham” in every case.