Spam Detection with SMS Data

This project is on detecting spam short message services (SMS) in order to practice making machine learning models and to play around with text data. The data I will be looking at is from Kaggle.

The steps I will be taking for this project are:

What does the data look like?

All the columns are object data types, that means we most likely have text (string) information in each of the columns.

To check if there are any null (empty) values in the columns:

The columns Unnamed: 2, Unnamed: 3, and Unnamed: 4 have a large number of null values (the proportion of null to non-null values are large) The description of the dataset (from Kaggle) says that v1 is the classification and v2 is the SMS, so the other columns having a lot of null values makes sense. I wonder why it was included?

Since columns v1 and v2 are the columns that will work the best, we can rename v1 and v2 to class and msg to help with clarity.

Note that the ham class has a lot more entries than spam, which means that this is an imbalanced dataset.

Creating a word map of both spam and ham messages will help us visualise a difference between the two groups.

Spam SMS has words like "txt, call, free, mobile, ..." and ham words are "will, u, now, go, ..." There are words that both sets have in common ("u, now, call, ...") but hopefully there is a big enough difference between spam and ham messages that the models can tell the difference most of the time.

Let's start cleaning the data! The goal is to create the smallest set of unique words that will hopefully identify if the SMS is spam or not. Punctuation and whitespaces will be removed because whitespaces will not help us classify the SMS, and punctuation will be removed since "Hello!!!" is just "Hello" in meaning (although you could argue that punctuation could help classify the data but we want the smallest set possible). Stopping words, which are commonly used words that help make the sentance coherent but have no meaning in the English language (a, the, is, are), will be removed along with any numbers.

I will also try to get words to a root form by using stemming techniques such as PorterStemmer and WordNetLemmatizer.

Creating the Models

For this dataset we have two classifications, Spam or Ham?

The models I will be testing are: From sklearn:

From TensorFlow:

First I will gather the data into the features to be trained (X) and the corresponding classification (y). I will then split the data into training and validation sets (80% for training, and 20% for validation). The training set is what is used for the model to "learn" and the validation set is to check how the model does with data it has not seen. The training and validation sets should be disjoint (the sets should have no common SMS).

For the sklearn I will create a pipeline for all the models that will convert the words into numbers (CountVectorizer) and then it will weigh the words based on the SMS and all the words in the dataset (TfidfTransformer) and then use the classification method for each model.

Mutltinomial Naive Bayes

The first model I will use is The Mutltinomial Naive Bayes model.

The reason why I think this model could make a good prediction is because the Naive Baye's Models is based on the probabilities of the set of words given the classification.

The MultiNB model is ~95% accurate with the validation data but the overall accuracy does not tell whole story! With a classification report we can see what the precision and recall of the model is

Personally I would rather if a ham SMS never be marked as spam but I would be okay with a few spam SMS making its way through the model. The ham SMS predictions has a recall of 100% which means all the ham SMS was predicted correctly (yay!).

This is the heatmap of the predictions on the validation set. I think the heatmap gives a good visualization of the predictions since we only have 2 classes.

Support Vector Machine

A SVM tries to "draw" a line or curve to separate the classes. Depending on where which side of the line a data point lies is how the prediction is made. This is very easy to visualize in 2-Dimensions but our data is greater than 3-Dimensions which is super difficult to show.

The SVM models has a accuracy of ~97.9% which is higher than the MultiNB but what about the precision and recall?

The recall of the ham class went down to 99% meaning that there is a 99% chance that a ham SMS will be predicted correctly. Even though the overall accuracy went up, the recall of a ham SMS is not perfect. Let's see what SMS that were ham were marked as spam.

We can see that the messages marked as spam has words like "new", "chance", "text, "message", "reply". These are words that would be in a spam SMS.

Logistical Regression

The Logistic Regression model overall accuracy went down to ~96.2% compared to the SVM model but it is slightly better that the MultiNB model. We can see that the recall on ham SMS is 100% (yay!) and the other statistics are slightly better than the MultiNB model!

Tensorflow

I will be testing the Long Short Term Memory (LSTM). I think this would be a good fit because the data is a sequence of words (a sentance) and I believe that there are words earlier that will relate to later words in the sequence when predicting if the SMS is spam.

Setting up the data is similar to the sklearn models but we no longer have the nice pipeline.

I first count the number of unique words in the cleaned set of words. This is used for the tokenizer to convert the features into sequences which will assign a number to each word in the feature set. We will also need to make sure that the sequneces put through the model are all the same length (pad with 0's). So let's look at a rough plot to see how the number of words of the cleaned SMS are distributed.

The distribution is right skewed.

When choose a length for the sequence to be padded to, if we used the the largest number of words (89) as the sequence length then we would have a lot of useless information in most of the sequences. If we use the median we will truncate half of the SMS when converting them into the sequences so I think the best length for the sequence would be a length that includes 99% of the words (which is 33).

I will also convert the classifications to be either 0 or 1 (ham or spam respectively)

First create the tokenizer from the words in the cleaned SMS.

Separate into training and validation sets

Confirm that we have about the same proportion of spam and ham words in each set.

Then turn the features into sequences

Then pad the sequence with 0's to make sure all sequences are equal length (length of 33)

note: 0 is not assigned to any word in the tokenizer and about 1% of the SMS will be truncated

Now we can make the tensorflow model

To get the classification report we need to do a bit of work to convert the predictions to the format that classification_report wants.

Running this model will have slightly different outcomes because of the stochastic nature of the dropout in the LSTM layer but for the last time running this notebook, we have a overall accuracy of ~98%, and a recall of ham SMS is 100%

Looking at the loss and accuracy plots, we can see that the model does do better on the training set compared to the validation set. I think the difference is okay and that the model is not overfitting because the difference between the training and validation is not that big.

Conclusion

Overall I think I learned a lot on how to handle text data and feel a bit more comfortable with implementing models for prediction.

But before I end this notebook let's try some data that the models have not seen before!