从MediaWiki XML提取页面标题和贡献者 [英] Extracting page titles and contributors from MediaWiki XML

查看:76
本文介绍了从MediaWiki XML提取页面标题和贡献者的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的(7GB)MediaWiki XML转储,其中包含对Wiki每个页面所做的每次更改的记录.我试图记录哪些用户贡献了每个页面,所以我想从XML中提取出来.

I have a very large (7GB) MediaWiki XML dump, which consists of records of each change made to each page of the Wiki. I am trying to record which users have contributed to each page, and so I want to extract that from the XML.

XML类似于:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/">
 <page>
  <title>Unique Page title</title>
  <id>11</id>
  <restrictions>sysop</restrictions>
  <revision>
    <id>11</id>
    <timestamp>2005-10-26T02:23:03Z</timestamp>
    <contributor>
      <ip>MediaWiki default</ip>
    </contributor>
    <text xml:space="preserve">i</text>
  </revision>
 </page>
 <page> ... </page>
 <page> ... </page>
 ...
</mediawiki>

对于如此大小的文件,我相信我需要使用iterparse.现在,我只是想打印出标题,但是当我运行以下代码时,它会显示无".

For a file this size, I believe I need to use iterparse. For now, I'm just trying to print out the title, but when I run the following code, it prints "None".

NS = '{http://www.mediawiki.org/xml/export-0.3/}'
from xml.etree.ElementTree import iterparse
with open('XMLFile.xml') as f:
    for event, elem in iterparse(f):
        if elem.tag == NS + 'page':
            for node in elem:
                if node.tag == NS + 'title':
                    print node.text()
        elem.clear()

推荐答案

打印title元素的文本内容时得到None,因为您使用的是 iterparse() 仅生成结束"事件.发出page的结束"事件时,其所有子元素(包括title)都已清除(清空).

You get None when printing the text content of the title element because you are using elem.clear() "too early". By default, iterparse() only generates "end" events. When the "end" event for page is emitted, all its subelements, including title, have already been cleared (emptied).

如果问题代码中的elem.clear()仅向右移动一个缩进级别(四个空格),它将按预期工作.使代码工作的另一种方法是将iterparse(f)更改为iterparse(f, events=["start"]).

If elem.clear() in the code in the question is moved just one indentation level (four spaces) to the right, it will work as expected. Another way to make your code work is to change iterparse(f) to iterparse(f, events=["start"]).

node.text()应该是node.text.

请参见 http://effbot.org/zone/element-iterparse.htm 有关iterparse()的更多详细信息.

See http://effbot.org/zone/element-iterparse.htm for more details on iterparse().

假定XML转储(mw.xml)如下所示:

Assume that the XML dump (mw.xml) looks like this:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/">
  <page>
    <title>Unique Page title 1</title>
    <id>11</id>
    <restrictions>sysop</restrictions>
    <revision>
      <id>11</id>
      <timestamp>2005-10-26T02:23:03Z</timestamp>
      <contributor>
       <username>Alice</username>
      </contributor>
      <text xml:space="preserve">i</text>
    </revision>
  </page>

  <page>
    <title>Unique Page title 2</title>
    <id>11</id>
    <restrictions>sysop</restrictions>
    <revision>
      <id>11</id>
      <timestamp>2005-10-26T02:23:03Z</timestamp>
      <contributor>
       <username>Bob</username>
      </contributor>
      <text xml:space="preserve">j</text>
    </revision>
  </page>
</mediawiki>

以下是有关如何获得标题和贡献者的建议:

Here is a suggestion on how you can get the title and contributor:

from xml.etree.ElementTree import iterparse

NS = '{http://www.mediawiki.org/xml/export-0.3/}'

with open('mw.xml') as f:
    for event, elem in iterparse(f):
        if elem.tag == '{0}page'.format(NS):
            title = elem.find("{0}title".format(NS))
            contr = elem.find(".//{0}username".format(NS))

            if title is not None:
                print title.text
            if contr is not None:
                print contr.text

            elem.clear()

输出:

Unique Page title 1 
Alice
Unique Page title 2 
Bob

我假设您需要贡献者的用户名.根据最新的 XML模式contributor可以包含usernameip和/或id子元素(对于0.3版的架构也是如此).

I'm assuming that you want the username of the contributor. According to the latest XML Schema, contributor can contain username, ip, and/or id child elements (this is true also for the 0.3 version of the schema).

这篇关于从MediaWiki XML提取页面标题和贡献者的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆