Python:报纸模块 - 有什么方法可以直接从 URL 集中获取文章? [英] Python: Newspaper Module - Any way to pool getting articles straight from URLs?

查看:14
本文介绍了Python:报纸模块 - 有什么方法可以直接从 URL 集中获取文章?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 此处 的 Python 报纸模块.

I'm using the Newspaper module for python found here.

在教程中,它描述了如何汇集不同报纸的构建.它同时生成它们.(参见上面链接中的多线程文章下载")

In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at the same time. (see the "Multi-threading article downloads" in the link above)

有没有办法直接从网址列表中提取文章?也就是说,有什么方法可以将多个 url 输入到以下设置中并让它同时下载和解析它们?

Is there any way to do this for pulling articles straight from a LIST of urls? That is, is there any way I can pump in multiple urls into the following set-up and have it download and parse them concurrently?

from newspaper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a = Article(url, language='zh') # Chinese
a.download()
a.parse()
print(a.text[:150])

推荐答案

我能够通过为每个文章 URL 创建一个 Source 来做到这一点.(免责声明:不是python开发人员)

I was able to do this by creating a Source for each article URL. (disclaimer: not a python developer)

import newspaper

urls = [
  'http://www.baltimorenews.net/index.php/sid/234363921',
  'http://www.baltimorenews.net/index.php/sid/234323971',
  'http://www.atlantanews.net/index.php/sid/234323891',
  'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
]

class SingleSource(newspaper.Source):
    def __init__(self, articleURL):
        super(StubSource, self).__init__("http://localhost")
        self.articles = [newspaper.Article(url=url)]

sources = [SingleSource(articleURL=u) for u in urls]

newspaper.news_pool.set(sources)
newspaper.news_pool.join()

for s in sources:
  print s.articles[0].html

这篇关于Python:报纸模块 - 有什么方法可以直接从 URL 集中获取文章?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆