如何建立维基百科类别层次结构? [英] How to build wikipedia category hierarchy?

查看:33
本文介绍了如何建立维基百科类别层次结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建维基百科文章及其类别的树形图.我需要做什么?

来自本网站 (http://dumps.wikimedia.org/enwiki/latest/),我已经下载:

  • enwiki-latest-page.sql.gz
  • enwiki-latest-categorylinks.sql.gz
  • enwiki-20141106-category.sql.gz

我尝试遵循此处的答案(转储中的维基百科类别层次结构),但是类别链接似乎没有相同的架构(没有 pageId 列).

构建层次结构的正确方法是什么?

额外问题:我如何知道 enwiki-latest-page.sql.gz 中的 35M 页面中哪些是文章(根据维基百科的统计数据,大约 5M)

谢谢

解决方案

是的,事实证明是这样 stackoverflow 答案 是对的.它引用了正确的数据集,但我太密集了,无法理解如何将它们关联在一起.

感谢@svick 引导我完成私人聊天中的各个步骤.

为了他人的利益,我在我的博客中明确详细说明了数据集之间的关系以及遍历图表的确切步骤,这是我们私聊的摘要.

解析维基百科页面层次结构

I'm trying to build the treegraph of wikipedia articles and its categories. What do I need to do that?

From this site (http://dumps.wikimedia.org/enwiki/latest/), I've downloaded:

  • enwiki-latest-page.sql.gz
  • enwiki-latest-categorylinks.sql.gz
  • enwiki-20141106-category.sql.gz

I tried followed the answer here (Wikipedia Category Hierarchy from dumps), but it doesn't seem that the categorylinks has the same schema (no pageId column).

What's the right way to build the hierarchy?

Bonus question: How can I tell which of the 35M pages in enwiki-latest-page.sql.gz are articles (supposedly about 5M according to wikipedia statistics)

Thanks

解决方案

Yes, it turns out this stackoverflow answer was right. It referenced the right datasets, but I was too dense to understand how to relate them together.

Thanks to @svick for leading me through the individual steps in a private chat.

For the benefit of others, I've explicitly detailed the relationship between the data sets and the exact steps to traverse the graph in my blog, which is a summary of our private chat.

Parsing Wikipedia Page Hierarchy

这篇关于如何建立维基百科类别层次结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆