python查询维基百科性能 [英] python querying wikipedia performance

查看:32
本文介绍了python查询维基百科性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只需要出于一个非常特殊的目的查询维基百科,即获取给定网址的文本.更准确一点:

I need to query wikipedia for just one very particular purpose, that is to to get the text for a given url. To be a little more precise:

我有大约 14.000 个英文语料库的 wikipedia url,我需要获取文本,或者至少是每个 url 的介绍.我的进一步处理将在 python 中进行,因此这将是首选语言.

I have about 14.000 wikipedia urls of the english corpus and I need to get the text, or at least the introduction of each of these urls. My further processing will be in python, so this would be the language of choice.

我正在寻找性能最佳的方法,并提出了 4 种不同的方法:

I am searching for the method with best performance and made up 4 different approaches:

  1. 直接通过python获取xml转储和解析
    -> 这里的进一步问题是:如何查询 xml 文件,知道 url?
  2. 获取xml,设置数据库并使用python查询sql
    -> 这里的进一步问题是:如何在知道 url 的情况下查询 sql?
  3. 使用维基百科的api,直接通过python查询
  4. 只需抓取这些维基百科页面(这可能有点偷偷摸摸,也很烦人,因为它的 html 而不是纯文本)

我应该使用哪种方法,我.e.哪种方法具有最佳性能并且在某种程度上是标准的?

which method should I use, i. e. which method has best performance and is somehow standard?

推荐答案

一些想法:

我有大约 14.000 个英文语料库的维基百科网址,我需要获取文本,或者至少是每个网址的介绍.

I have about 14.000 wikipedia urls of the english corpus and I need to get the text, or at least the introduction of each of these urls.

1 - 直接通过python获取xml转储并解析

1 - get the xml dump and parse directly via python

英文维基百科目前有 4,140,​​640 篇文章.您对 14,000 篇文章或总数的三分之一左右感兴趣.这听起来太稀疏了,无法让转储所有文章成为最好的方法.

There are currently 4,140,640 articles in the English Wikipedia. You're interested in 14,000 articles or about one third of a percent of the total. That sounds too sparse to allow dumping all the articles to be the best approach.

2 - 获取xml,设置数据库,用python查询sql

2 - get the xml, set up the database and query sql with python

您是否希望您感兴趣的文章集增长或改变?如果您需要快速响应文章集的变化,本地数据库可能会很有用.但是你必须让它保持最新.如果速度足够快,使用 API 获取实时数据会更简单.

Do you expect the set of articles your interested in to grow or change? If you need to rapidly respond to changes in your set of articles, a local database may be useful. But you'll have to keep it up to date. It's simpler to get the live data using the API, if that's fast enough.

4 - 只需抓取这些维基百科页面(这可能有点偷偷摸摸,也很烦人,因为它的 html 而不是纯文本)

4 - Just crawl these wikipedia pages (which is maybe kind of sneaky and as well annoying because its html and no plain text)

如果你能从 API 中得到你需要的东西,那就比爬取维基百科网站要好.

If you can get what you need out of the API, that will be better than scraping the Wikipedia site.

3 - 使用维基百科 api 并通过 python 直接查询

3 - use the wikipedia api and query it directly via python

基于您感兴趣的文章的低百分比(0.338%),这可能是最好的方法.

Based on the low percentage of articles that you're interested in, 0.338%, this is probably the best approach.

请务必查看 MediaWiki API 文档API 参考.还有 python-wikitools 模块.

Be sure to check out The MediaWiki API documentation and API Reference. There's also the python-wikitools module.

我需要得到文本,或者至少是介绍

I need to get the text, or at least the introduction

如果您真的只需要介绍,那将节省大量流量,并且确实使使用 API 成为迄今为止的最佳选择.

If you really only need the intro, that will save a lot of traffic and really makes using the API the best choice, by far.

有多种方法可以检索简介,这是一种很好的方法:

There are a variety of ways to retrieve the introduction, here's one good way:

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&format=xml&titles=Python_(programming_language)

如果您一次要处理多个请求,您可以将它们分成最多 20 篇文章的组:

If you have many requests to process at a time, you can batch them in groups of up to 20 articles:

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&exlimit=20&format=xml&titles=Python_(programming_language)|History_of_Python|Guido_van_Rossum

通过这种方式,您可以在 700 次往返中检索 14,000 篇文章介绍.

This way you can retrieve your 14,000 article introductions in 700 round trips.

注意: API 参考 exlimit 文档说明:

不允许超过 20 个(机器人 20 个)

No more than 20 (20 for bots) allowed

另请注意: 有关礼仪和使用限制的 API 文档部分 说:

如果您串行而不是并行地提出请求(即在发送新请求之前等待一个请求完成,这样您永远不会同时发出一个以上的请求),那么您绝对应该没事的.还可以尝试将所有内容合并到一个请求中(例如,在 titles 参数中使用多个标题,而不是为每个标题发出一个新请求.

If you make your requests in series rather than in parallel (i.e. wait for the one request to finish before sending a new request, such that you're never making more than one request at the same time), then you should definitely be fine. Also try to combine things into one request where you can (e.g. use multiple titles in a titles parameter instead of making a new request for each title.

维基百科不断更新.如果您需要刷新数据,跟踪修订 ID 和时间戳将使您能够识别哪些本地文章已过时.您可以使用(例如)检索修订信息(连同介绍,这里有多篇文章):

Wikipedia is constantly updated. If you ever need to refresh your data, tracking revision IDs and timestamps will enable you to identify which of your local articles are stale. You can retrieve revision information (along with the intro, here with multiple articles) using (for example):

http://en.wikipedia.org/w/api.php?action=query&prop=revisions|extracts&exintro&exlimit=20&rvprop=ids|timestamp&format=xml&titles=Python_(programming_language)|History_of_Python|Guido_van_Rossum

这篇关于python查询维基百科性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆