Scrape A Political Website To Build A Fake & Real News Data Set Using Python

Build a fake /real news data set !

Image for post
Image for post

In this article, I will show you how to create your very own data set that contains false/fake and true/real news by scraping a political website called PolitiFact.com using python ! You can later use this data set for other projects like creating models to detect fake news.

ake news is information that is untrue represented as news. Many times it is used and created to damage the reputation of a person or entity, or to make money. It would be nice to be able to detect fake news using a simple model, but before that can happen, I need data to train my model on, hence this article. So without further ado, let’s understand what we need to do before we start writing code for the program.

Understanding the Steps and Concepts Before Writing the Program

  1. First we need to identify a website to scrape. Luckily I already chose the website PolitiFact.com. PolitiFact.com is a nonprofit project operated by the Poynter Institute in St. Petersburg, Florida, with offices there and in Washington, D.C.
Image for post
Image for post
Image showing the website to scrape data from

2. Identify the information that you want to scrape. I want to get the statement that is being classified as either fake or real, I want to get the author , the statement, the source, maybe the date and last but not least the classification of the statement (true, false, etc.).

Image for post
Image for post
Image highlighting data to scrape

3. Use the inspect feature on your browser to identify where this data exists on the webpage. For example, right click on the source (the Donald Trump link) and click inspect. Notice where the text lies in the <div> tag where the class = m-statement__meta. Within that tag is the <a> tag and within that tag is the text ‘Donald Trump’ that we want to scrape and store into our data set for sources.

Image for post
Image for post
Image showing the inspect option

We will need to do a similar process for the rest of the data. The target or classification of the statement is slightly different and is within the image. More specifically it is within the ‘alt’ attribute of the image. This will tell us if the statement is true or false.

Image for post
Image for post
Image showing the location of the target

4. Notice that this is only page 1 from the link. https://www.politifact.com/factchecks/list/?page=1. This means there are likely other pages as well, and indeed there are. So we will loop through maybe 100 pages and scrape each one using the same steps above.

We can do this by simply adding the new number at the end of the link.
For example: https://www.politifact.com/factchecks/list/?page=45.

If you prefer not to read this article and would like a video representation of it, you can check out the YouTube Video below. It goes through everything in this article with a little more detail, and will help make it easy for you to start programming . Or you can use both as supplementary materials for learning !

Okay, now with all of that being said let’s program !!

Program

First I like to start my program with a description.

#Description: This programs scrapes FAKE and REAL news data from a website

Import the dependencies.

#Import the dependencies
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib.request
import time

Create a lists to store the scraped data.

#Create lists to store the scraped data
authors = []
dates = []
statements = []
sources = []
targets = []

Create a function to scrape the website.

#Create a function to scrape the site
def scrape_website(page_number):
page_num = str(page_number) #Convert the page number to a string
URL = 'https://www.politifact.com/factchecks/list/?page='+page_num #append the page number to complete the URL
webpage = requests.get(URL) #Make a request to the website
#time.sleep(3)
soup = BeautifulSoup(webpage.text, "html.parser") #Parse the text from the website
#Get the tags and it's class
statement_footer = soup.find_all('footer',attrs={'class':'m-statement__footer'}) #Get the tag and it's class
statement_quote = soup.find_all('div', attrs={'class':'m-statement__quote'}) #Get the tag and it's class
statement_meta = soup.find_all('div', attrs={'class':'m-statement__meta'})#Get the tag and it's class
target = soup.find_all('div', attrs={'class':'m-statement__meter'}) #Get the tag and it's class
#loop through the footer class m-statement__footer to get the date and author
for i in statement_footer:
link1 = i.text.strip()
name_and_date = link1.split()
first_name = name_and_date[1]
last_name = name_and_date[2]
full_name = first_name+' '+last_name
month = name_and_date[4]
day = name_and_date[5]
year = name_and_date[6]
date = month+' '+day+' '+year
dates.append(date)
authors.append(full_name)
#Loop through the div m-statement__quote to get the link
for i in statement_quote:
link2 = i.find_all('a')
statements.append(link2[0].text.strip())
#Loop through the div m-statement__meta to get the source
for i in statement_meta:
link3 = i.find_all('a') #Source
source_text = link3[0].text.strip()
sources.append(source_text)
#Loop through the target or the div m-statement__meter to get the facts about the statement (True or False)
for i in target:
fact = i.find('div', attrs={'class':'c-image'}).find('img').get('alt')
targets.append(fact)

Loop through ’n-1’ number of pages to scrape the data.

#Loop through 'n-1' webpages to scrape the data
n=101
for i in range(1, n):
scrape_website(i)

Create and show the data frame to store the data.

#Create a new dataFrame 
data = pd.DataFrame(columns = ['author', 'statement', 'source', 'date', 'target'])
data['author'] = authors
data['statement'] = statements
data['source'] = sources
data['date'] = dates
data['target'] = targets
#Show the data set
data
Image for post
Image for post

Create a function to get convert the target data to a binary number.

#Create a function to get a binary number from the target
def getBinaryNumTarget(text):
if text == 'true':
return 1
else:
return 0

Create a function to get only the true and false values from the target.

#Create a function to get only true or false values from the target
def getBinaryTarget(text):
if text == 'true':
return 'REAL'
else:
return 'FAKE'

Store the data in the data frame.

#Store the data in the dataframe
data['BinaryTarget'] = data['target'].apply(getBinaryTarget)
data['BinaryNumTarget'] = data['target'].apply(getBinaryNumTarget)

Show the data.

#Show the data
data
Image for post
Image for post

Store the data to a csv file.

#Store the data to a CSV file
data.to_csv('political_fact_checker.csv')
Image for post
Image for post

Thanks for reading this article I hope it’s helpful to you all! If you enjoyed this article and found it helpful please leave some claps to show your appreciation. Keep up the learning, and if you like machine learning, mathematics, computer science, programming or algorithm analysis, please visit and subscribe to my YouTube channels (randerson112358 & compsci112358 ).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store