Scrape web links
Python is a general purpose programming language that seems to be on the rise with Data Analytics / Science and Machine Learning. It has many capabilities do to its many libraries/packages. When doing Data Analytics / Science tasks, their will be times that you will want to use data from the internet. That data may only be available on a webpage, so for cases like this you will want to use a special technique called web scraping to gather the data to do your analysis. One very useful capability of Python is scraping data. Let’s get started scraping using Python3 and the library BeautifulSoup!
Step 0: Install Python version 3
Before we began you must have Python version 3 installed. You can download it here.
Step 1: Choose a website that you want to scrape .
I chose my BLOG, everythingcomputerscience.com/test/simpleHTML.html . We obviously need a website that has links we want to scrape.
Step 2: Install Python packages (request & beautifulSoup)
We need the request and beautifulSoup package to help do the scraping. We need the request package to send HTTP/1.1 requests, and the beautifulSoup package to pull data out of the HTML files. Use the following commands:
pip install beautifulSoup4
pip install requests
Step 3: Create a python file.
I called my Python file “scrape.py”. This can be accomplished by
i) Opening Notepad or any other Text Editor on your computer
ii) Click File → Save As
iii) Save as type All Files (*.*)
NOTE: Remember where you saved this file
Step 4: Create the scraping code
In the Python file (scrape.py), we will insert the code doing the scraping.
i) import the two packages (BeautifulSoup and request)
from bs4 import BeautifulSoup
ii) Ask the user for the input URL to scrape the data from
url = input(“Enter a website to extract the links from: “)
iii) Request data from the server using the GET protocol
r = requests.get(url)
iv) Convert the raw response to text to retrieve the data
data = r.text
v) Use the Python HTML Parser, to pull data out of the HTML file
soup = BeautifulSoup(data, ‘html-parser’)
vi) Create an empty list to store the links in
list = ‘’
vii) Get all the links from <a> tags with attribute href, and store it in lists variable
for link in soup.find_all(‘a’):
list += link.get(‘href’) + ‘\n’
viii) Print the list A.K.A the links
NOTE: Don’t forget to save this file.
You can get the actual code from my GitHub.
Step 5: Run the program
Now that we have finished all of the above steps it’s time to actually run the code!
- Open command prompt on your Windows Computer or a Terminal on your Mac or Linux OS.
2. Navigate to where you saved your Python script “scrape.py”. Don’t forget to put “python” before it. For example
3. If everything is correct then you should be prompted to enter the website URL.
Enter a website to extract the links from:
Step 6: Here is a video showing the steps 0 to 5
Thanks for reading this article I hope its helpful to you all ! If you would like more computer science and algorithm analysis videos please visit and subscribe to my YouTube channels (randerson112358 & compsci112358 )
Check Out the following for content / videos on Computer Science, Algorithm Analysis, Programming and Logic:
Video Tutorials on Recurrence Relation:
Video Tutorial on Algorithm Analysis: