转储中的Wikipedia类别层次结构 [英] Wikipedia Category Hierarchy from dumps

查看:87
本文介绍了转储中的Wikipedia类别层次结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Wikipedia的转储,我想为其类别构建层次结构.我已经下载了主要转储(enwiki-latest-pages-articles)和类别SQL转储(enwiki-latest-category).但是我找不到层次结构信息.

Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.

例如,SQL类别的转储具有每个类别的条目,但是我找不到关于它们之间的相互关系的任何信息.

For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.

另一个转储(最新页面文章)说出每个页面的父类别,但以无序的方式.它只是陈述了所有的父母.

The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.

我看过Wikiprep的类别层次结构( http://www .cs.technion.ac.il/〜gabr/resources/code/wikiprep/)...那是怎么构造的? Wikiprep列出类别ID,而不是其名称.有没有一种方法可以获取每个ID的名称?

I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?

推荐答案

MediaWiki中的类别层次结构信息存储在 categorylinks,因此您将需要categorylinks转储.

The category hierarchy information in MediaWiki is stored in the categorylinks table, so you're going to need the categorylinks dump.

您还需要page(不是pages-articles)转储,以将页面ID映射到标题.

You're also going to need the page (not pages-articles) dump for page id to title mapping.

这篇关于转储中的Wikipedia类别层次结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆