如何获取所有Wikipedia文章的标题列表 [英] How to obtain a list of titles of all Wikipedia articles
问题描述
我想获取所有Wikipedia文章的所有标题的列表.我知道有两种可能的方法可以从Wikimedia支持的Wiki中获取内容.一种是API,另一种是数据库转储.
I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.
我不想下载Wiki转储.首先,它庞大,其次,我对查询数据库没有真正的经验.另一方面,API的问题是我无法找出一种仅检索文章标题列表的方法,即使它需要> 4个mio请求,也可能使我无法再进行任何其他请求.
I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.
所以我的问题是
- 是否可以通过API获得仅维基百科文章的标题?
- 有没有一种方法可以将多个请求/查询合并为一个?还是我实际上必须下载Wikipedia转储?
推荐答案
allpages
API模块允许你就是这样做的.它的限制(设置aplimit=max
时)为500,因此要查询所有450万篇文章,大约需要9000个请求.
The allpages
API module allows you to do just that. Its limit (when you set aplimit=max
) is 500, so to query all 4.5M articles, you would need about 9000 requests.
But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0
which, as its name suggests, contains exactly what you want (59 MB of gzipped text).
这篇关于如何获取所有Wikipedia文章的标题列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!