Email Spam Detection Using Python & Machine Learning

Image for post
Image for post

Email spam, also called junk email, is unsolicited messages sent in bulk by email (spamming). The name comes from Spam luncheon meat by way of a Monty Python sketch in which Spam is ubiquitous, unavoidable, and repetitive.

In this article I will show you how to create your very own program to detect email spam using a machine learning technique called natural language processing, and the Python programming language !

If you prefer not to read this post and would like a video representation of it, you can check out the YouTube Video below. It goes through everything in this article with a little more detail and will help make it easy for you to start programming your own email spam detection program even if you don’t have the programming language Python installed on your computer. Or you can use both as supplementary materials for learning!

Programming

The first thing that I like to do before writing a single line of code is to put in a description in comments of what the code does. This way I can look back on my code and know exactly what it does.

Description: This program detects if an email is spam (1) or not (0)

Import the libraries

#Import libraries
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

Load the data and print the first 5 rows.

#Load the data
#from google.colab import files # Use to load data on Google Colab
#uploaded = files.upload() # Use to load data on Google Colab
df = pd.read_csv('emails.csv')
df.head(5)
Image for post
Image for post
The first 5 rows of data

Let’s explore the data and get the number of rows & columns.

#Print the shape (Get the number of rows and cols)
df.shape
Image for post
Image for post
Number of rows: 5728, Number of columns: 2

Get the column names in the data set.

#Get the column names
df.columns
Image for post
Image for post
The column names ‘text’ & ‘spam’

Check for duplicates and remove them.

#Checking for duplicates and removing them
df.drop_duplicates(inplace = True)

Show the new number of the rows and columns (if any) .

#Show the new shape (number of rows & columns)
df.shape
Image for post
Image for post
Number of rows: 5695, Number of columns: 2

Show the number of missing data for each column.

#Show the number of missing (NAN, NaN, na) data for each column
df.isnull().sum()
Image for post
Image for post

Download the stop words. Stop words in natural language processing, are useless words (data).

#Need to download stopwords
nltk.download('stopwords')

Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuations and then removing the useless words also known as stop words.

#Tokenization (a list of tokens), will be used as the analyzer
#1.Punctuations are [!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
#2.Stop words in natural language processing, are useless words (data).
def process_text(text):

#1 Remove Punctuationa
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)

#2 Remove Stop Words
clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

#3 Return a list of clean words
return clean_words

The process of returning tokens from text is known as Tokenization. Show the Tokenization of the first 5 rows of text data from our data set by applying the function process_text .

#Show the Tokenization (a list of tokens )
df['text'].head().apply(process_text)
Image for post
Image for post

Convert the text into a matrix of token counts.

from sklearn.feature_extraction.text import CountVectorizermessages_bow = CountVectorizer(analyzer=process_text).fit_transform(df['text'])

Split the data into training & testing sets, and print them. We will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value.

The testing feature (independent) data set will be stored in X_test and the testing target (dependent) data set will be stored in y_test .

The training feature (independent) data set will be stored in X_train and the training target (dependent) data set will be stored in y_train .

#Split data into 80% training & 20% testing data setsfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, df['spam'], test_size = 0.20, random_state = 0)

Get the shape of the data.

#Get the shape of messages_bow
messages_bow.shape
Image for post
Image for post

Create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features (e.g., word counts for text classification)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

Print the classifiers prediction and actual values on the data set.

#Print the predictions
print(classifier.predict(X_train))
#Print the actual values
print(y_train.values)
Image for post
Image for post
top: predicted values, bottom: actual values

See how well the model performed by evaluating the Naive Bayes classifier and showing the report, confusion matrix & accuracy score.

#Evaluate the model on the training data set
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = classifier.predict(X_train)
print(classification_report(y_train ,pred ))
print('Confusion Matrix: \n',confusion_matrix(y_train,pred))
print()
print('Accuracy: ', accuracy_score(y_train,pred))
Image for post
Image for post
Metrics report followed by the confusion matrix and accuracy score

It looks like the model / classifier used is 99.71% accurate. Let’s test the model / classifier on the test data set (X_test& y_test) by printing the predicted value, and the actual value to see if the model can accurately classify the email text/message.

#Print the predictions
print('Predicted value: ',classifier.predict(X_test))
#Print Actual Label
print('Actual value: ',y_test.values)
Image for post
Image for post
Sample of the predicted/actual values.

Evaluate the model on the test data set

#Evaluate the model on the test data set
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = classifier.predict(X_test)
print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n', confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))
Image for post
Image for post

The classifier accurately identified the email messages as spam or not spam with 99.2 % accuracy on the test data !

Conclusion and Resources

That is it, you are done creating your email spam detection program !

Again, if you want, you can watch and listen to me explain all of the code in my YouTube video.

If you are interested in reading about machine learning to immediately get started with problems and examples, I recommend you read Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.

It is a great book for helping beginners learn to write machine-learning programs and understanding machine-learning concepts.

Image for post
Image for post

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Thanks for reading this article, I hope it’s helpful to you!

Other Resources:

(1) TFIDF Transformer
(2) Count Vectorizer
(3) Simple Spam Filter Naive Bayes
(4) Spam Ham Detection Using Naive Bayes
(5) Bag of Words
(6) Spam Detection With Logistic Regression
(7) Spam Detection
(8) Data Source

Image for post
Image for post

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store