如何获取所有Wikipedia文章的标题列表 [英] How to obtain a list of titles of all Wikipedia articles

查看:87
本文介绍了如何获取所有Wikipedia文章的标题列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取所有Wikipedia文章的所有标题的列表.我知道有两种可能的方法可以从Wikimedia支持的Wiki中获取内容.一种是API,另一种是数据库转储.

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.

我不想下载Wiki转储.首先,它庞大,其次,我对查询数据库没有真正的经验.另一方面,API的问题是我无法找出一种仅检索文章标题列表的方法,即使它需要> 4个mio请求,也可能使我无法再进行任何其他请求.

I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.

所以我的问题是

  1. 是否可以通过API获得仅维基百科文章的标题?
  2. 有没有一种方法可以将多个请求/查询合并为一个?还是我实际上必须下载Wikipedia转储?

推荐答案

allpages API模块允许你就是这样做的.它的限制(设置aplimit=max时)为500,因此要查询所有450万篇文章,大约需要9000个请求.

The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.

但是转储是一个更好的选择,因为转储有很多不同的类型,包括

But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).

这篇关于如何获取所有Wikipedia文章的标题列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆