Heart Disease Detection Using Machine Learning & Python

The term “heart disease” is often used interchangeably with the term “cardiovascular disease.” Cardiovascular disease generally refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered forms of heart disease.
Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease; heart rhythm problems (arrhythmias); and heart defects you’re born with (congenital heart defects), among others. Many forms of heart disease can be prevented or treated with healthy lifestyle choices. — Mayo Clinic
In this article I will show you how to create a program in Python to detect if a person has a cardiovascular disease or not. Information about the data set that I will be using throughout this article and program can be found below.
Data Set Features:Age | age | int (days)|Height | height | int (cm) |Weight | weight | float (kg) |Gender | gender | categorical code |Systolic blood pressure | ap_hi | int |Diastolic blood pressure | ap_lo | int |Cholesterol | cholesterol | 1: normal, 2: above normal, 3: well above normal |Glucose | gluc | 1: normal, 2: above normal, 3: well above normal |Smoking | smoke | binary |Alcohol intake | alco | binary |Physical activity | active | binary |Presence or absence of cardiovascular disease | cardio | binary |
If you prefer not to read this article and would like a video representation of it, you can check out the YouTube Video . It goes through everything in this article with a little more detail, and will help make it easy for you to start programming your own Machine Learning model even if you don’t have the programming language Python installed on your computer. Or you can use both as supplementary materials for learning about Machine Learning !
Programming:
The first thing that I like to do before writing a single line of code is to put in a description in comments of what the code does. This way I can look back on my code and know exactly what it does.
#Description:#This program classifies a person as having a cardiovascular disease (1) or not (0)#So the target class "cardio" equals 1, when the patient has cardiovascular disease, and it's 0, when the patient is healthy.
Import the libraries.
#Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
Load the data set.
#Load the data
from google.colab import files # Use to load data on Google Colab
uploaded = files.upload() # Use to load data on Google Colab
Store the data into a variable and print the data.
#Store the data into the df variable
df = pd.read_csv('cardio.csv')
df.head(7) #Print the first 7 rows

Get the number of rows and columns.
#Get the shape of the data (the number of rows & columns)
df.shape

Count the number of empty values in each column.
#Count the empty (NaN, NAN, na) values in each column
df.isna().sum()

Here is another way to check if your data set contains any nulls values.
#Another check for any null / missing values
df.isnull().values.any()
Get some statistics on the data.
#View some basic statistical details like percentile, mean, standard deviation etc.df.describe()

Get a count of the number of individuals with a cardiovascular disease and the number of individuals without a cardiovascular disease.
#Get a count of the number of patients with (1) and without (0) a cardiovasculer diseasedf['cardio'].value_counts()

Visualize the number of individuals with a cardiovascular disease and the number of individuals without a cardiovascular disease.
#Visualize this count
sns.countplot(df['cardio'])

Let’s look at the number of people with a Cardio Vascular Disease that exceed the number of people without a Cardio Vascular Disease?
# Let's look at the number of people with a Cardio Vascular Disease that exceed
#the number of people without a Cardio Vascular Disease?#Create a years column
df['years'] = ( df['age'] / 365).round(0) #Get the years by dividing the age in days by 365df["years"] = pd.to_numeric(df["years"],downcast='integer') # Convert years to an integer#Visualize the data
#colorblind palette for colorblindness
sns.countplot(x='years', hue='cardio', data = df, palette="colorblind", edgecolor=sns.color_palette("dark", n_colors = 1));

Get the correlation of the columns.
#Get the correlation of the columns
df.corr()

Visualize the correlation.
#Visualize the correlation
import matplotlib.pyplot as plt
plt.figure(figsize=(7,7)) #7in by 7in
sns.heatmap(df.corr(), annot=True, fmt='.0%')

Remove or drop the years column and the id column.
# Remove or drop the years column
df = df.drop('years', axis=1)#Remove or drop the id column
df = df.drop('id', axis=1)
Split the data into feature data and target data.
#Split the data into feature data and target data
X = df.iloc[:, :-1].values
Y = df.iloc[:, -1].values
Split the data again, into 75% training data set and 25% testing data set.
#Split the data again, into 75% training data set and 25% testing data setfrom sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size= 0.25, random_state = 1)
Scale the values in the data to be values between 0 and 1 inclusive.
#Feature Scaling
#Scale the values in the data to be values between 0 and 1 inclusive
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Create the machine learning model called a Random Forest Classifier.
# Use Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 1)
forest.fit(X_train, Y_train)
Test the models accuracy on the training data.
#Test the models accuracy on the trainingg data set
model = forest
model.score(X_train, Y_train)

The model was about 97.99% accurate on the training data.
Test the models accuracy on the test data set by creating a confusion matrix and then using the confusion matrix to compute the accuracy score.
Print the confusion matrix and the accuracy to the screen.
#Test the models accuracy on the test data set
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, model.predict(X_test))TN = cm[0][0]
TP = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]#Print the confusion matrix
print(cm)#Print the models accuracy on the test data
print('Model Test Accuracy = {}'. format( (TP + TN)/ (TP +TN + FN + FP) ) )

The model was 70.2% accurate on the test data. This is okay, but when it comes to individuals and their health, you would want to get a much higher accuracy score than that.
With some more tweaking of this program maybe it is possible to get a higher accuracy score !
If you are interested in reading more on machine learning to immediately get started with problems and examples then I strongly recommend you check out Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. It is a great book for helping beginners learn how to write machine learning programs, and understanding machine learning concepts.

Thanks for reading this article I hope its helpful to you all ! If you enjoyed this article and found it helpful please leave some claps to show your appreciation. Keep up the learning, and if you like machine learning, mathematics, computer science, programming or algorithm analysis, please visit and subscribe to my YouTube channels (randerson112358 & compsci112358 ).
