How To Scrape Websites Using Python

Scrape web links

Image for post
Image for post

Python is a general purpose programming language that seems to be on the rise with Data Analytics / Science and Machine Learning. It has many capabilities do to its many libraries/packages. When doing Data Analytics / Science tasks, their will be times that you will want to use data from the internet. That data may only be available on a webpage, so for cases like this you will want to use a special technique called web scraping to gather the data to do your analysis. One very useful capability of Python is scraping data. Let’s get started scraping using Python3 and the library BeautifulSoup!

Step 0: Install Python version 3

Before we began you must have Python version 3 installed. You can download it here.

Step 1: Choose a website that you want to scrape .

I chose my BLOG, everythingcomputerscience.com/test/simpleHTML.html . We obviously need a website that has links we want to scrape.

Image for post
Image for post

Step 2: Install Python packages (request & beautifulSoup)

We need the request and beautifulSoup package to help do the scraping. We need the request package to send HTTP/1.1 requests, and the beautifulSoup package to pull data out of the HTML files. Use the following commands:

pip install beautifulSoup4
pip install requests

Step 3: Create a python file.

I called my Python file “scrape.py”. This can be accomplished by
i) Opening Notepad or any other Text Editor on your computer
ii) Click File → Save As

Image for post
Image for post

iii) Save as type All Files (*.*)
NOTE: Remember where you saved this file

Image for post
Image for post

Step 4: Create the scraping code

In the Python file (scrape.py), we will insert the code doing the scraping.

i) import the two packages (BeautifulSoup and request)

from bs4 import BeautifulSoup
import requests

ii) Ask the user for the input URL to scrape the data from

url = input(“Enter a website to extract the links from: “)

iii) Request data from the server using the GET protocol

r = requests.get(url)

iv) Convert the raw response to text to retrieve the data

data = r.text

v) Use the Python HTML Parser, to pull data out of the HTML file

soup = BeautifulSoup(data, ‘html-parser’)

vi) Create an empty list to store the links in

list = ‘’

vii) Get all the links from <a> tags with attribute href, and store it in lists variable

for link in soup.find_all(‘a’):
list += link.get(‘href’) + ‘\n’

viii) Print the list A.K.A the links

print(list)

NOTE: Don’t forget to save this file.

You can get the actual code from my GitHub.

Step 5: Run the program

Now that we have finished all of the above steps it’s time to actually run the code!

  1. Open command prompt on your Windows Computer or a Terminal on your Mac or Linux OS.
Image for post
Image for post

2. Navigate to where you saved your Python script “scrape.py”. Don’t forget to put “python” before it. For example

C:\Users\randerson112358>python C:\Users\randerson112358\Desktop\scrape.py

3. If everything is correct then you should be prompted to enter the website URL.

Enter a website to extract the links from:
http://everythingcomputerscience.com/test/simpleHTML.html

4. Results

http://www.yahoo.com
http://www.yahoo.com

Step 6: Here is a video showing the steps 0 to 5

Thanks for reading this article I hope its helpful to you all ! If you would like more computer science and algorithm analysis videos please visit and subscribe to my YouTube channels (randerson112358 & compsci112358 )

Check Out the following for content / videos on Computer Science, Algorithm Analysis, Programming and Logic:

YouTube Channel:
randerson112358: https://www.youtube.com/channel/UCaV_0qp2NZd319K4_K8Z5SQ

compsci112358:
https://www.youtube.com/channel/UCbmb5IoBtHZTpYZCDBOC1CA

Website:
http://everythingcomputerscience.com/

Video Tutorials on Recurrence Relation:
https://www.udemy.com/recurrence-relation-made-easy/

Video Tutorial on Algorithm Analysis:
https://www.udemy.com/algorithm-analysis/

Twitter:
https://twitter.com/CsEverything

YouTube Channel:

Image for post
Image for post

Computer Science Website:

Image for post
Image for post

Udemy Videos on Recurrence Relation:

Image for post
Image for post

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store