Newspaper3k: A Python Library for Article Parsing

Since the 2016 election, news has been a hot topic in the United States. Fake news, fringe sites, the Executive Office, and hackers have all become sort of synonymous. For example, sites like The Goldwater and Mediate were thrust into the world’s newsfeed, when before they were no-named internet entities.

Wikileaks, a self-described news publisher, capatilizes on the fact that computer hackers and news publishers go hand in hand. So of course someone would come along and make a library to ease the process of article aggregation.

Newspaper3k is a open source library that allows you to pass in almost any URL, and it will auto-detect things like the text, author, and title. Before, with a library like BeautifulSoup, you had to specify each unique identifier on a webpage. Now, with this software, you just parse the URL, and it returns easily retrievable data.

To use the library, you first must install it by using pip3. After you are done installing, you can import the library like you would any other into a Python program.

pip3 install newspaper3k
How to install newspaper3k

Next, to use the program, just pass in a URL and call the methods described in the READ.me file. These methods will clean all types of reliable, clean data in a easy to use fashion. This library takes the grunt work out of article parsing. No more, does one need to write some methods for CNN.com and another set of methods for NYTimes.com. Now, all you have to do is pass in the URL, and this library will take care of the rest.

mufasa.gq
Some examples of how to use newspaper3k

Ive made a script that visits 8ch.net, and parses the news links on the top of their site. The program utilizes BeautfulSoup to get the URLs, and then just passes the URLs to Newspaper3k.

from bs4 import BeautifulSoup
from newspaper import Article
import requests
import json

def use_paper_module(url):
    turn_to_json = {}
    article = Article(url)
    article.download()
    article.parse()
    turn_to_json['title'] = article.title
    turn_to_json['text'] = article.text
    turn_to_json['summary'] = article.summary
    turn_to_json['submit'] = 'submit'
    return turn_to_json

r = requests.get("https://8ch.net/index.html")
soup = BeautifulSoup(r.text, "html.parser")
divs = soup.findAll('div', attrs={'class':'col-6'})
urls = []

for i in divs:
    links = i.findAll('a')
    for a in links:
    if '8ch.net' in a['href'] or 'archive' in a['href']:
        pass
    else:
        urls.append(a['href'])

for link in urls:
    jason = use_paper_module(link)
    r = requests.post("http://example.com/story_catch.php", data=jason)
    print(r.text)

You Might Also Like

Leave a Reply

Your email address will not be published. Required fields are marked *