In this article I will show you how to collect and scrape news data from different sources in a unified way using the python packages newspaper3k and nltk. The documentation for newspaper3k has all of the methods and information that you need to get started using this package. You can see the source code at https://github.com/codelucas/newspaper.
I’m a big advocate of not reinventing the wheel, and the newspaper3k package makes it very easy and simple to extract information from the web.
If you prefer not to read this article and would like a video representation of it, you can check out the YouTube Video below and the full code on my Github. It goes through everything in this article with a little more detail, and will help make it easy for you to start programming in Python even if you don’t have it installed on your computer. Or you can use both as supplementary materials for learning !
The first thing that I like to do before writing a single line of code is to put in a description in comments of what the code does. This way I can look back on my code and know exactly what it does.
#Description: Scrape and Summarize News Articles
You will need to install the packages newspaper3k and nltk.
pip install nltk
pip install newspaper3k
Import the packages needed for this program.
#Import the libraries
from newspaper import Article
In this post we will scrape the article from The Washington Post titled You downloaded FaceApp. Here’s what you’ve just done to your privacy,which is an article about the app called FaceApp. So the first thing that we will do is get the article.
#Get the article
url = 'https://www.washingtonpost.com/technology/2019/07/17/you-downloaded-faceapp-heres-what-youve-just-done-your-privacy/?noredirect=on&utm_term=.1938589d078f'article = Article(url)
Once we have the articles URL, we need to download the URL HTML content, parse the article, download the sentence tokenizer and extract key words.
# Do some NLP
article.download() #Downloads the link’s HTML content
article.parse() #Parse the article
nltk.download('punkt')#1 time download of the sentence tokenizer
article.nlp()# Keyword extraction wrapper
Now we have everything set up for us to start using some of the methods to extract information. Let’s get the authors of the article.
#Get the authors
Next we will get the date that the article was published.
#Get the publish date
I want to also get the top image link of the article.
#Get the top image
Get the text from the article.
#Get the article text
Finally we will summarize the article.
#Get a summary of the article
Conclusion and Resources
That is it, you are done creating your program to scrape data from the web !
If you are interested in reading more about machine learning to immediately get started with problems and examples, I recommend you read Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.
It is a great book for helping beginners learn to write machine-learning programs and understanding machine-learning concepts.
Thanks for reading this article, I hope it’s helpful to you!