Python:报纸模块-有什么方法可以直接从URL获取文章? [英] Python: Newspaper Module - Any way to pool getting articles straight from URLs?

查看:106
本文介绍了Python:报纸模块-有什么方法可以直接从URL获取文章?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在此处中找到适用于python的Newspaper模块.

I'm using the Newspaper module for python found here.

在教程中,它描述了如何汇集不同报纸的建筑物.它同时生成它们. (请参阅上面的链接中的多线程文章下载")

In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at the same time. (see the "Multi-threading article downloads" in the link above)

有什么方法可以直接从网址列表中提取文章吗?也就是说,有什么方法可以将多个URL输入以下设置,并同时下载并解析它们?

Is there any way to do this for pulling articles straight from a LIST of urls? That is, is there any way I can pump in multiple urls into the following set-up and have it download and parse them concurrently?

from newspaper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a = Article(url, language='zh') # Chinese
a.download()
a.parse()
print(a.text[:150])

推荐答案

我能够通过为每个文章URL创建一个Source来做到这一点. (免责声明:不是python开发人员)

I was able to do this by creating a Source for each article URL. (disclaimer: not a python developer)

import newspaper

urls = [
  'http://www.baltimorenews.net/index.php/sid/234363921',
  'http://www.baltimorenews.net/index.php/sid/234323971',
  'http://www.atlantanews.net/index.php/sid/234323891',
  'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
]

class SingleSource(newspaper.Source):
    def __init__(self, articleURL):
        super(StubSource, self).__init__("http://localhost")
        self.articles = [newspaper.Article(url=url)]

sources = [SingleSource(articleURL=u) for u in urls]

newspaper.news_pool.set(sources)
newspaper.news_pool.join()

for s in sources:
  print s.articles[0].html

这篇关于Python:报纸模块-有什么方法可以直接从URL获取文章?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆