Stock Market Sentiment Analysis Using Python & Machine Learning
Predict if a companies stock will increase or decrease based on news headlines using sentiment analysis
In this article, I will attempt to determine if the price of a stock will increase or decrease based on the sentiment of top news article headlines for the current day using Python and machine learning.
The idea is to either create or find a data set that has news article headlines of a particular stock or company , then gather the stock prices for the days that the news articles came out and perform sentiment analysis & machine learning on the data to determine if the price of the stock will increase or decrease.
What Is Sentiment Analysis ?
Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, subject etc. is positive, negative, or neutral. — Wikipedia
“Sentiment analysis is the measurement of neutral, negative, and positive language. It is a way to evaluate spoken or written language to determine if the expression is favorable (positive), unfavorable (negative), or neutral, and to what degree.” — Clarabridge
“Sentiment analysis: the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral.” — Oxford English Dictionary
I decided to use a data set that contains top news articles on the Dow Jones Industrial Average (DJIA), and a label column that shows when the DJIA adjusted close price rose or stayed the same (represented by the number ‘1’) and when it decreased (represented by the number ‘0’).
If you prefer not to read this post and would like a video representation of it, you can check out the YouTube Video below. It goes through everything in this article with a little more detail and will help make it easy for you to start programming even if you don’t have the programming language Python installed on your computer. Or you can use both as supplementary materials for learning!
I will first start the program with a description about the program in comments.
#This program determines if the price of a stock will increase or decrease based off of news sentiment
Install the dependency.
pip install vaderSentiment
Import the important libraries that will be needed throughout the program.
# Import the libraries
import pandas as pd
import numpy as np
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Load the data sets.
# Load the data
from google.colab import files
Store the data into variables.
#Store the data into variables
df1 = pd.read_csv('Dow_Jones_Industrial_Average_News.csv')
df2 = pd.read_csv('Dow_Jones_Industrial_Average_Stock.csv')
Show the first 3 rows of data for the df1.
#Show the first 3 rows of data for the news (df1)
Show the first 3 rows of data for df2.
#Show the first 3 rows of data for the stock (df2)
Next Merge the two data sets together and print the first 3 rows.
#Merge the data set on the date field
merge = df1.merge(df2, how='inner', on='Date', left_index = True)#Show the merged data set
Combine the top news headlines together .
#Combine the top news headlines
headlines = for row in range(0,len(merge.index)):
headlines.append(' '.join(str(x) for x in merge.iloc[row,2:27]))
Clean the headline data.
#Clean the data
clean_headlines = for i in range(0, len(headlines)):
clean_headlines.append(re.sub("b[(')]+", '', headlines[i] ))
clean_headlines[i] = re.sub('b[(")]+', '', clean_headlines[i] )
clean_headlines[i] = re.sub("\'", '', clean_headlines[i] )
Add the cleaned headlines to the data set.
#Add the clean headlines to the data set
merge['Combined_News'] = clean_headlines
Next create two functions, one to get the subjectivity of the headlines and the other to get the polarity.
The subjectivity shows how subjective or objective a statement is.
The polarity shows how positive/negative the statement is, a value equal to 1 means the statement is positive, a value equal to 0 means the statement is neutral and a value of -1 means the statement is negative.
# Create a function to get the subjectivity
return TextBlob(text).sentiment.subjectivity# Create a function to get the polarity
Create two new columns for the subjectivity and polarity.
# Create two new columns 'Subjectivity' & 'Polarity'
Next create a function to get sentiment scores (neg, pos, neu, & compound). The compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive).
Pos is the positive percentage score, neg is the negative percentage score, and neu is the neutral percentage score.
The total for %pos + %neg + %neu = 100%
#Create a function to get the sentiment scores (using Sentiment Intensity Analyzer)def getSIA(text):
sia = SentimentIntensityAnalyzer() sentiment = sia.polarity_scores(text)
Get the sentiment scores for each day.
#Get the sentiment scores for each daycompound = 
neg = 
neu = 
pos = 
SIA = 0for i in range(0, len(merge['Combined_News'])):
SIA = getSIA(merge['Combined_News'][i])
Store the sentiment scores in the data set.
#Store the sentiment scores in the data framemerge['Compound'] =compound
merge['Positive'] = pos
Create a list of columns to keep in the completed data set and show the data.
#Create a list of columns to keepkeep_columns = [ 'Open', 'High', 'Low', 'Volume', 'Subjectivity', 'Polarity', 'Compound', 'Negative', 'Neutral' ,'Positive', 'Label' ]
df = merge[keep_columns]
Create the feature and target data sets.
#Create the feature data set
X = df
X = np.array(X.drop(['Label'],1))#Create the target data set
y = np.array(df['Label'])
Split the data into 80% training and 20% testing data sets.
#Split the data into 80% training and 20% testing data sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)
Create and train the model.
model = LinearDiscriminantAnalysis().fit(x_train, y_train)
Get and show the models predictions.
#Get the models predictions/classification
predictions = model.predict(x_test)
Show the models metrics.
#Show the models metrics
print( classification_report(y_test, predictions) )
It looks like this model is about 84% accurate which isn’t bad ! A lot more testing is needed to be done and some fine tuning but this is a great start. I hope you enjoyed the article !
If you are interested in reading more on machine learning to immediately get started with problems and examples then I strongly recommend you check out Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. It is a great book for helping beginners learn how to write machine learning programs, and understanding machine learning concepts.
Thanks for reading this article I hope its helpful to you all ! If you enjoyed this article and found it helpful please leave some claps to show your appreciation. Keep up the learning, and if you like machine learning, mathematics, computer science, programming or algorithm analysis, please visit and subscribe to my YouTube channels (randerson112358 & computer science).