从离线转储中提取属于某个类别的Wikipedia文章 [英] Extract wikipedia articles belonging to a category from offline dumps
问题描述
我有不同语言的维基百科文章转储.我想用属于某个类别的文章过滤它们(特别是类别:WikiProject_Biography )
I have wikipedia article dumps in different languages. I want to filter them with articles which belong to a category(specifically Category:WikiProject_Biography)
例如,我可能会遇到很多类似的问题:
I could get a lot of similar questions for example:
- Wikipedia API to get articles belonging to a category
- How do I get all articles about people from Wikipedia?
但是,我想全部离线进行.那是在使用转储,也用于不同的语言.
However, I would like to do it all offline. That is using dumps, and also for different languages.
我探讨的其他内容是类别表和类别链接表. MediaWiki_1.28.0_database_schema
Other things which I explored are category table and category link table. MediaWiki_1.28.0_database_schema
推荐答案
从转储中获取page
和categorylinks
表,然后运行
Fetch the page
and categorylinks
tables from the dump, then run
SELECT
page_namespace,
page_title
FROM
page
JOIN categorylinks ON page_id = cl_from
WHERE
cl_to = 'WikiProject_Biography'
;
获取页面列表.
这篇关于从离线转储中提取属于某个类别的Wikipedia文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!