Scrape & Summarize News Articles Using Python

Image for post
Image for post

In this article I will show you how to collect and scrape news data from different sources in a unified way using the python packages newspaper3k and nltk. The documentation for newspaper3k has all of the methods and information that you need to get started using this package. You can see the source code at https://github.com/codelucas/newspaper.

I’m a big advocate of not reinventing the wheel, and the newspaper3k package makes it very easy and simple to extract information from the web.

If you prefer not to read this article and would like a video representation of it, you can check out the YouTube Video below and the full code on my Github. It goes through everything in this article with a little more detail, and will help make it easy for you to start programming in Python even if you don’t have it installed on your computer. Or you can use both as supplementary materials for learning !

Programming:

The first thing that I like to do before writing a single line of code is to put in a description in comments of what the code does. This way I can look back on my code and know exactly what it does.

You will need to install the packages newspaper3k and nltk.

Import the packages needed for this program.

In this post we will scrape the article from The Washington Post titled You downloaded FaceApp. Here’s what you’ve just done to your privacy,which is an article about the app called FaceApp. So the first thing that we will do is get the article.

Once we have the articles URL, we need to download the URL HTML content, parse the article, download the sentence tokenizer and extract key words.

Now we have everything set up for us to start using some of the methods to extract information. Let’s get the authors of the article.

Image for post
Image for post
A list of all of the authors found in the article

Next we will get the date that the article was published.

Image for post
Image for post
The article published date July 17, 2019

I want to also get the top image link of the article.

Image for post
Image for post
The image source: https://www.washingtonpost.com/resizer/JGS0Z2IB3PMcQr0zWU69TNxo0cQ=/1484x0/arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/J76RFFMEIVAJ3NTZ4YEXMMBJGQ.jpg

Get the text from the article.

Image for post
Image for post
Sample of the article text

Finally we will summarize the article.

Image for post
Image for post
Summary of the article

Conclusion and Resources

That is it, you are done creating your program to scrape data from the web !

If you want to see the code, just go to my GitHub account and you can look at it there. Again, if you want, you can watch and listen to me explain all of the code in my YouTube video.

If you are interested in reading more about machine learning to immediately get started with problems and examples, I recommend you read Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.

It is a great book for helping beginners learn to write machine-learning programs and understanding machine-learning concepts.

Image for post
Image for post

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Thanks for reading this article, I hope it’s helpful to you!

Other resources

  1. Documentation
  2. newspaper3k Git Hub
  3. You downloaded FaceApp. Here’s what you’ve just done to your privacy
  4. Scrape and summarize News Articles in 5 lines of Python Code

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store