Email Spam Detection Using Python & Machine Learning

Image for post
Image for post

Email spam, also called junk email, is unsolicited messages sent in bulk by email (spamming). The name comes from Spam luncheon meat by way of a Monty Python sketch in which Spam is ubiquitous, unavoidable, and repetitive.

In this article I will show you how to create your very own program to detect email spam using a machine learning technique called natural language processing, and the Python programming language !

If you prefer not to read this post and would like a video representation of it, you can check out the YouTube Video below. It goes through everything in this article with a little more detail and will help make it easy for you to start programming your own email spam detection program even if you don’t have the programming language Python installed on your computer. Or you can use both as supplementary materials for learning!

Programming

The first thing that I like to do before writing a single line of code is to put in a description in comments of what the code does. This way I can look back on my code and know exactly what it does.

Import the libraries

Load the data and print the first 5 rows.

Image for post
Image for post
The first 5 rows of data

Let’s explore the data and get the number of rows & columns.

Image for post
Image for post
Number of rows: 5728, Number of columns: 2

Get the column names in the data set.

Image for post
Image for post
The column names ‘text’ & ‘spam’

Check for duplicates and remove them.

Show the new number of the rows and columns (if any) .

Image for post
Image for post
Number of rows: 5695, Number of columns: 2

Show the number of missing data for each column.

Image for post
Image for post

Download the stop words. Stop words in natural language processing, are useless words (data).

Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuations and then removing the useless words also known as stop words.

The process of returning tokens from text is known as Tokenization. Show the Tokenization of the first 5 rows of text data from our data set by applying the function process_text .

Image for post
Image for post

Convert the text into a matrix of token counts.

Split the data into training & testing sets, and print them. We will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value.

The testing feature (independent) data set will be stored in X_test and the testing target (dependent) data set will be stored in y_test .

The training feature (independent) data set will be stored in X_train and the training target (dependent) data set will be stored in y_train .

Get the shape of the data.

Image for post
Image for post

Create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features (e.g., word counts for text classification)

Print the classifiers prediction and actual values on the data set.

Image for post
Image for post
top: predicted values, bottom: actual values

See how well the model performed by evaluating the Naive Bayes classifier and showing the report, confusion matrix & accuracy score.

Image for post
Image for post
Metrics report followed by the confusion matrix and accuracy score

It looks like the model / classifier used is 99.71% accurate. Let’s test the model / classifier on the test data set (X_test& y_test) by printing the predicted value, and the actual value to see if the model can accurately classify the email text/message.

Image for post
Image for post
Sample of the predicted/actual values.

Evaluate the model on the test data set

Image for post
Image for post

The classifier accurately identified the email messages as spam or not spam with 99.2 % accuracy on the test data !

Conclusion and Resources

That is it, you are done creating your email spam detection program !

Again, if you want, you can watch and listen to me explain all of the code in my YouTube video.

If you are interested in reading about machine learning to immediately get started with problems and examples, I recommend you read Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.

It is a great book for helping beginners learn to write machine-learning programs and understanding machine-learning concepts.

Image for post
Image for post

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Thanks for reading this article, I hope it’s helpful to you!

Other Resources:

(1) TFIDF Transformer
(2) Count Vectorizer
(3) Simple Spam Filter Naive Bayes
(4) Spam Ham Detection Using Naive Bayes
(5) Bag of Words
(6) Spam Detection With Logistic Regression
(7) Spam Detection
(8) Data Source

Image for post
Image for post

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store