如何使用 Python 在类别维基百科页面的类别中抓取子类别和页面 [英] How to scrape Subcategories and pages in categories of a Category wikipedia page using Python

查看:31
本文介绍了如何使用 Python 在类别维基百科页面的类别中抓取子类别和页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我试图在类别页面的类别标题下抓取所有子类别和页面:类别:基于类的编程语言"位于:

So I'm trying to scrape all the subcategories and pages under the category header of the Category page: "Category: Class-based programming languages" found at:

https://en.wikipedia.org/wiki/Category:Class-based_programming_languages

我已经找到了一种使用 URL 和 mediawiki API 做到这一点的方法:Categorymembers.这样做的方法是:

I've figured out a way to do this using urls and the mediawiki API: Categorymembers. The way to do that would be:

  • 基础:en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500
  • base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=子猫

但是,我找不到使用 Python 完成此操作的方法.有人可以帮我吗?

However, I can't find a way to accomplish this using Python. Can anyone help me out here?

这是用于独立学习的,我在这方面花了很多时间,但似乎无法弄清楚.此外,禁止使用 Beautifulsoup.谢谢大家的帮助!

This is for independent study and I've spent a lot of time on this, but just can't seem to figure it out. Also, the use of Beautifulsoup is prohibited. Thank you for all the help!

推荐答案

好的,经过更多的研究和学习,我能够找到我自己问题的答案.使用库 urllib.request 和 json,我以 json 格式导入维基百科 url 文件,并简单地以这种方式打印其类别.这是我用来获取子类别的代码:

Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request and json, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:

pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.phpaction=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
data = json.load(pages)
query = data['query']
category = query['categorymembers']
for x in category:
    print (x['title'])

您可以对类别中的页面执行相同的操作.感谢 Nemo 试图帮助我!

And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!

这篇关于如何使用 Python 在类别维基百科页面的类别中抓取子类别和页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆