Python Decision Tree Classifier Example

Make Golf Predictions Using A Decision Tree

Image for post
Image for post

In this article I will use the python programming language and a machine learning algorithm called a decision tree, to predict if a player will play golf that day based on the weather (Outlook, Temperature, Humidity, Windy).

Decision Trees are a type of Supervised Learning Algorithms(meaning that they were given labeled data to train on). The training data is continuously split into two more sub-nodes according to a certain parameter. The tree can be explained by two things, leaves and decision nodes. The decision nodes are where the data is split. The leaves are the decisions or the final outcomes. You can think of a decision tree in programming terms as a tree that has a bunch of “if statements” for each node until you get to a leaf node (the final outcome).

Decision Tree Pros:

  1. Simple to understand and to interpret
  2. List Requires little data preparation

Decision Tree Cons:

  1. Prone to over-fitting
  2. Decision trees can be unstable (a small variation in the data may result in a completely different tree being generated)

If you prefer not to read this article and would like a video representation of it, you can check out the video below. It goes through everything in this article with a little more detail, and will help make it easy for you to start programming your own Decision Tree Machine Learning model even if you don’t have the programming language Python installed on your computer. Or you can use both as supplementary materials for learning about Decision Trees !

A Python Decision Tree Example Video

Start Programming

The first thing to do is to install the dependencies or the libraries that will make this program easier to write. I will import the machine learning library sklearn, pandas, pydontplus and IPython.display.

## import dependencies
from sklearn import tree #For our Decision Tree
import pandas as pd # For our DataFrame
import pydotplus # To create our Decision Tree Graph
from IPython.display import Image # To Display a image of our graph

Next I will create the data set that will be used for this example on Decision Trees, by first creating an empty pandas data frame, and inputting data into every column/feature/attribute (Outlook, Temperature, Humidity, Windy, Play).

Data Description:

Outlook = The outlook of the weather
Temperature = The temperature of the weather
Humidity = The humidity of the weather
Windy = A variable if it is windy that day or not
Play = The target variable, tells if the golfer played golf that day or not

Outlook values: sunny, overcast, rainy
Temperature values: hot, mild, cold
Humidity values: high, normal
Windy values: true, false
Play values: yes, no

#Create the dataset
#create empty data frame
golf_df = pd.DataFrame()

#add outlook
golf_df['Outlook'] = ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy',
'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast',
'overcast', 'rainy']

#add temperature
golf_df['Temperature'] = ['hot', 'hot', 'hot', 'mild', 'cool', 'cool', 'cool',
'mild', 'cool', 'mild', 'mild', 'mild', 'hot', 'mild']

#add humidity
golf_df['Humidity'] = ['high', 'high', 'high', 'high', 'normal', 'normal', 'normal',
'high', 'normal', 'normal', 'normal', 'high', 'normal', 'high']

#add windy
golf_df['Windy'] = ['false', 'true', 'false', 'false', 'false', 'true', 'true',
'false', 'false', 'false', 'true', 'true', 'false', 'true']

#finally add play
golf_df['Play'] = ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes',
'yes', 'yes', 'no']


#Print/show the new data
print(golf_df)
Image for post
Image for post
The Data Set Created In golf_df

Now I will convert the categorical variables (Outlook, Temperature, Humidity, Windy, Play) into dummy/indicator variables or (binary variables) essentially 1’s and 0's.

# Convert categorical variable into dummy/indicator variables or (binary vairbles) essentialy 1's and 0's# I chose the variable name one_hot_data bescause in ML one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)one_hot_data = pd.get_dummies(golf_df[ ['Outlook', 'Temperature', 'Humidity', 'Windy'] ])#print the new dummy data
one_hot_data
Image for post
Image for post
Partial one_hot_data set up to column Temperature_mild

Finally I get to the point of creating and training the Decision Tree Classifier (the model)!

# The decision tree classifier.
clf = tree.DecisionTreeClassifier()
# Training the Decision Tree
clf_train = clf.fit(one_hot_data, golf_df['Play'])

Next I will graph the Decision Tree to get a better visual of what the model is doing, by printing the DOT data of the tree, graphing the DOT data using pydontplus graph_from_dat_data method and displaying the graph using IPython.display Image method.

# Export/Print a decision tree in DOT format.
print(tree.export_graphviz(clf_train, None))

#Create Dot Data
dot_data = tree.export_graphviz(clf_train, out_file=None, feature_names=list(one_hot_data.columns.values),
class_names=['Not_Play', 'Play'], rounded=True, filled=True) #Gini decides which attribute/feature should be placed at the root node, which features will act as internal nodes or leaf nodes
#Create Graph from DOT data
graph = pydotplus.graph_from_dot_data(dot_data)

# Show graph
Image(graph.create_png())
Image for post
Image for post
The Graph Of The Decision Tree

Last but not least, make the prediction, by inputting the Outlook as ‘sunny’, Temperature as ‘hot’, Humidity as ‘normal’ and Windy as ‘false’. My model predicted that input to be ‘yes’, meaning the golfer will play golf that day. The program is done! Of course this was just a simple example of a Decision Tree Classifier. In reality you would most likely upload a data set instead of create one, and then clean and explore the data, then split your data into a training set and a testing set and test your models accuracy using some statistical metric.

# Test model prediction input:
# Outlook = sunny,Temperature = hot, Humidity = normal, Windy = false
prediction = clf_train.predict([[0,0,1,0,1,0,0,1,1,0]])
prediction

You can see the video above for how I coded this program and code along with me with a few more detailed explanations, or you can just click the YouTube link here.

If you are also interested in reading more on machine learning to immediately get started with problems and examples then I strongly recommend you check out Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. It is a great book for helping beginners learn how to write machine learning programs, and understanding machine learning concepts.

Image for post
Image for post
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Thanks for reading this article I hope its helpful to you all ! If you enjoyed this article and found it helpful please leave some claps to show your appreciation. Keep up the learning, and if you like machine learning, mathematics, computer science, programming or algorithm analysis, please visit and subscribe to my YouTube channels (randerson112358 & compsci112358 ).

Image for post
Image for post

Resources:

https://chrisalbon.com/machine_learning/trees_and_forests/visualize_a_decision_tree/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store