Build Your Own Smart AI Chat Bot Using Python

A Step by step guide to build an intelligent chat bot using python.

Image for post
Image for post

In this article I will show you how to build your very own chat bot using the Python programming language and Machine Learning! More specifically I want to create a “Doctor Chat Bot On Chronic Kidney Disease”, meaning I can ask this chat bot about chronic kidney disease, and it can come up with a reasonable response.

What Is A Chat Bot?

A chat bot is software that conducts conversations. Many chat bots are created to simulate how a human would behave as a conversational partner. Chat bots are in many devices, for example Siri, Cortona, Alexa, and Google Assistant. Many chat bots are used now a days for customer service.

There are broadly two variants of chat bots: Rule-Based and Self Learning.
A Rule-Based chat bot is a bot that answers questions based on some rules that it is trained on, while a Self Learning chat bot is a chat bot that uses some Machine Learning based technique to chat.

We will use a rule based approach for responding back to greetings, and we will have the chat bot respond to questions and queries by taking in some text and having the chat bot select the best response back from that text. This type of self learning is called retrieval-based learning.

What Is Chronic Kidney Disease ?

Chronic kidney disease, also called chronic kidney failure, describes the gradual loss of kidney function. Your kidneys filter wastes and excess fluids from your blood, which are then excreted in your urine. When chronic kidney disease reaches an advanced stage, dangerous levels of fluid, electrolytes and wastes can build up in your body. -Mayo Clinic

In the early stages of chronic kidney disease, you may have few signs or symptoms. Chronic kidney disease may not become apparent until your kidney function is significantly impaired. -Mayo Clinic

Treatment for chronic kidney disease focuses on slowing the progression of the kidney damage, usually by controlling the underlying cause. Chronic kidney disease can progress to end-stage kidney failure, which is fatal without artificial filtering (dialysis) or a kidney transplant. -Mayo Clinic

Natural Language Processing Vocabulary

When it comes to Natural Language Processing (NLP), you will come across some terms that you may not be used to hearing. We will be using NLP throughout the code, and I will most certainly be using these terms in this article. A few of those terms and definitions are below:

Natural language processing (NLP) is a sub field of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages — Wikipedia

A corpus or text corpus is a large and structured set of texts of a particular author or a body of writing on a particular subject.

Bag of Words (BoW) is a Natural Language Processing technique of text modeling. It is called Bag of Words because any information about the order or structure of the word document is removed and the model is only worried about the frequency of the known words in the document.

Image for post
Image for post
Image Source

Consider the document :
“It was the best of times”
“It was the worst of times”
“It was the age of wisdom”
“It was the age of foolishness”

The dictionary contains the words:
{ ‘It’, ‘was’, ‘the’, ‘best’, ‘of’, ‘times’, ‘worst’, ‘age’, ‘wisdom’, ‘foolishness’}

If we want to vectorize the text “It was the best of times”, we would have the following vector: [1, 1, 1, 1, 1, 1,0, 0, 0,0].

The frequency of the words from the 10 unique words in the dictionary are below for the text “It was the best of times”.

“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0

-An Introduction To Bag of Words

Stemming is the process of reducing inflected words to their word stem, base or root form — generally written word form. For example if we were to stem the word “dance”, “dancing”, “dances”, the result would be the single word “dance”.

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item and is a variation of stemming. For example “feet” and “foot” are both recognized as “foot”.

NOTE: Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.

One common task in NLP (Natural Language Processing) is tokenization. “Tokens” are usually individual words (at least in languages like English) and “tokenization” is taking a text or set of text and breaking it up into its individual words or sentences.

CountVectorizer works on Terms Frequency, i.e. counting the occurrences of tokens and building a sparse matrix of documents x tokens. It is the number of times a term appears in a document/text.

TF-IDF stands for term frequency-inverse document frequency. TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

  • Term Frequency (TF): is a scoring of the frequency of the word in the current document or another way of saying it measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.
Image for post
Image for post
  • Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents or another way of saying it measures how important a term is. IDF is a measure of how rare a term is. Rarer the term, more is the IDF score.
Image for post
Image for post


Image for post
Image for post

TF-IDF is thus the product of TF and IDF or TF * IDF.

-An Introduction To Bag of Words

Start Programming:

I will start by stating what I want this program to do. This program takes text from an online website and uses it to chat and answer queries. We are essentially going to create a ‘smart’ chat bot program to answer queries on chronic kidney disease.

# Description: This is a 'smart' chatbot program

We need to install a few packages nltk and newspaper3k . NLTK is the Natural Language Tool Kit package, which is a popular package for NLP with Python. Newspaper3k is a python package used for extracting and parsing newspaper articles.

pip install nltk
pip install newspaper3k

Import The Libraries & Packages

Next import the libraries. We will use the newspaperlibrary to extract the text from the website by using the Article class. We will use the random library to generate a random number for our greeting response. We will use the string library to process the standard Python string. Thesklearn.feature_extraction.text library will be used to get the count vectorizer class countVectorizer to vectorize the text and evaluate how important a word is to a document. From the sklearn.metrics.pairwise library we will get the cosine_similarity method to see how similar the text is to the users queries. We will also import the numpy library to use some of it’s methods like the sort() method. Last but not least we will use the warnings library to ignore the warnings we get with this program.

#import libraries
from newspaper import Article
import random
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
import numpy as np
import warnings

Download the punkt package. Punkt is a pre-trained tokenizer model for the English language that divides the text into a list of sentences. If you’re unsure of which datasets/models you’ll need, you can install the popular package which is a subset of NLTK data, or all of the packages using

Note: the quiet option stops NLTK from outputting to the terminal when downloading.'punkt', quiet=True) # Download the punkt package

Download The Data From A Website

Read in the URL of the article to get the text corpus. Remember a text corpus or corpus is a large and structured set of texts of a particular author or a body of writing on a particular subject.

Download the article, then parse the article and apply NLP to the article. Then store the articles text into a variable corpus .

#Get the article URL
article = Article('') #Download the article
article.parse() #Parse the article
article.nlp() #Apply Natural Language Processing (NLP)
corpus = article.text #Store the article text into corpus

Print the corpus.

Image for post
Image for post
A sample of the printed text/corpus

Tokenize The Data

Next, we will tokenize the text by getting a list of sentences from the text.

text = corpus
sent_tokens = nltk.sent_tokenize(text)# txt to a list of sentences

Print the list of sentences .

#Print the list of sentences
Image for post
Image for post
Sample of the printed sentence token

Next we will use keyword matching (a rule based approach) to check for greeting type words as input from the user and respond back with a randomized greeting as output.

To do this we need to create a list of greeting inputs (the greetings we expect from the user) and then we will create a list of greeting responses (the greetings that our chat bot will use).

Then we will create a function to check for the users greetings and randomly choose a greeting response back.

#Function to return a random greeting response to a users greeting
def greeting_response(text):
#Convert the text to be all lowercase
text = text.lower()
# Keyword Matching
#Greeting responses back to the user from the bot
bot_greetings = ["howdy","hi", "hey", "what's good", "hello","hey there"]
#Greeting input from the user
user_greetings = ["hi", "hello", "hola", "greetings", "wassup","hey"]

#If user's input is a greeting, return a randomly chosen greeting response
for word in text.split():
if word in user_greetings:
return random.choice(bot_greetings)

Create a function to return the indices of the values from an array in sorted order by the arrays values. This function will help return the chat bot response.

#Return the indices of the values from an array in sorted order by the values
def index_sort(list_var):
length = len(list_var)
list_index = list(range(0, length))
x = list_var
for i in range(length):
for j in range(length):
if x[list_index[i]] > x[list_index[j]]:
temp = list_index[i]
list_index[i] = list_index[j]
list_index[j] = temp
return list_index

Generating The Chat Bot Response

We are going to create a function which will take in a users response or queries, and then send back the best response(s) selected from the corpus.

# Generate the response
def bot_response(user_input):
user_input = user_input.lower() #Convert the users input to all lowercase letters
sentence_list.append(user_input) #Append the users response to the list of sentence tokens
bot_response='' #Create an empty response for the bot
cm = CountVectorizer().fit_transform(sentence_list) #Create the count matrix
similarity_scores = cosine_similarity(cm[-1], cm) #Get the similarity scores to the users input
flatten = similarity_scores.flatten() #Reduce the dimensionality of the similarity scores
index = index_sort(flatten) #Sort the index from
index = index[1:] #Get all of the similarity scores except the first (the query itself)
response_flag=0 #Set a flag letting us know if the text contains a similarity score greater than 0.0
#Loop the through the index list and get the 'n' number of sentences as the response
j = 0
for i in range(0, len(index)):
if flatten[index[i]] > 0.0:
bot_response = bot_response+' '+sentence_list[index[i]]
response_flag = 1
j = j+1
if j > 2:
#if no sentence contains a similarity score greater than 0.0 then print 'I apologize, I don't understand'
bot_response = bot_response+' '+"I apologize, I don't understand."
sentence_list.remove(user_input) #Remove the users response from the sentence tokens

return bot_response

Start The Conversation

We can now create a continuous loop for the chat bot to converse with the user. We will run this loop until the users response is ‘exit’.

#Start the chat
print("Doc Bot: I am DOCTOR BOT or Doc Bot for short. I will answer your queries about Chronic Kidney Disease. If you want to exit, type Bye!")
exit_list = ['exit', 'see you later','bye', 'quit', 'break']while(True):
user_input = input()
if(user_input.lower() in exit_list):
print("Doc Bot: Chat with you later !")
if(greeting_response(user_input)!= None):
print("Doc Bot: "+greeting_response(user_input))
print("Doc Bot: "+bot_response(user_input))

If you are also interested in reading more on machine learning to immediately get started with problems and examples then I strongly recommend you check out Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. It is a great book for helping beginners learn how to write machine learning programs, and understanding machine learning concepts.

Image for post
Image for post

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Thanks for reading this article I hope it’s helpful to you all! If you enjoyed this article and found it helpful please leave some claps to show your appreciation. Keep up the learning, and if you like machine learning, mathematics, computer science, programming or algorithm analysis, please visit and subscribe to my YouTube channels (randerson112358 & compsci112358 ).

Image for post
Image for post


[3]Cosine Similarity
[4]Build Your Own ChatBot Using Python
[5]Build A Movie Recommendation Engine
[6]An Introduction To Bag of Words
[7] Mayo Clinic

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store