如何从维基百科页面标题获取页面ID [英] How to get page id from wikipedia page title

查看:335
本文介绍了如何从维基百科页面标题获取页面ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Wikipedia中查找页面列表的Wiki ID.因此,格式为:

I am trying to find the wiki id of list of pages from wikipedia. So, the format is:

输入:维基百科页面标题列表

input: list of wikipedia page titles

输出:维基百科页面ID的列表.

output: list of wikipedia page ids.

到目前为止,我已经通过Mediawiki API来了解如何进行操作,但是找不到实现该功能的正确方法.谁能建议如何获取页面ID列表?

So far, I've gone through Mediawiki API to understand how to proceed, but couldn't find a correct way to implement the function. Can anyone suggest how to get the list of page ids?

推荐答案

查询基本页面信息:

import requests

page_titles = ['A', 'B', 'C', 'D']
url = (
    'https://en.wikipedia.org/w/api.php'
    '?action=query'
    '&prop=info'
    '&inprop=subjectid'
    '&titles=' + '|'.join(page_titles) +
    '&format=json')
json_response = requests.get(url).json()

title_to_page_id  = {
    page_info['title']: page_id
    for page_id, page_info in json_response['query']['pages'].items()}

print(title_to_page_id)
print([title_to_page_id[title] for title in page_titles])

这将打印:

{'A': '290', 'B': '34635826', 'C': '5200013', 'D': '8123'}
['290', '34635826', '5200013', '8123']

如果标题过多,则必须在多个请求中查询它们,因为存在 50(对于机器人而言为500)限制,可以一次查询的标题数量.

If you have too many titles, you have to query for them in multiple requests because there is a 50 (500 for bots) limit for the number of titles that can be queried at once.

这篇关于如何从维基百科页面标题获取页面ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆