从离线转储中提取属于某个类别的Wikipedia文章 [英] Extract wikipedia articles belonging to a category from offline dumps

查看:108
本文介绍了从离线转储中提取属于某个类别的Wikipedia文章的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有不同语言的维基百科文章转储.我想用属于某个类别的文章过滤它们(特别是类别:WikiProject_Biography )

I have wikipedia article dumps in different languages. I want to filter them with articles which belong to a category(specifically Category:WikiProject_Biography)

例如,我可能会遇到很多类似的问题:

I could get a lot of similar questions for example:

  1. Wikipedia API以获取属于某个类别的文章
  2. 如何从维基百科?
  1. Wikipedia API to get articles belonging to a category
  2. How do I get all articles about people from Wikipedia?

但是,我想全部离线进行.那是在使用转储,也用于不同的语言.

However, I would like to do it all offline. That is using dumps, and also for different languages.

我探讨的其他内容是类别表和类别链接表. MediaWiki_1.28.0_database_schema

Other things which I explored are category table and category link table. MediaWiki_1.28.0_database_schema

推荐答案

从转储中获取pagecategorylinks表,然后运行

Fetch the page and categorylinks tables from the dump, then run

SELECT
    page_namespace,
    page_title
FROM
    page
    JOIN categorylinks ON page_id = cl_from
WHERE
    cl_to = 'WikiProject_Biography'
;

获取页面列表.

这篇关于从离线转储中提取属于某个类别的Wikipedia文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆