如何使用Python解析Wikipedia XML转储? [英] How can I parse a Wikipedia XML dump with Python?
问题描述
我有:
import xml.etree.ElementTree as ET
def strip_tag_name(t):
t = elem.tag
idx = k = t.rfind("}")
if idx != -1:
t = t[idx + 1:]
return t
events = ("start", "end")
title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
tname = strip_tag_name(elem.tag)
if event == 'end':
if tname == 'title':
title = elem.text
elif tname == 'page':
print(title, elem.text)
这似乎使标题很好,但页面文本
总是空白。我缺少什么?
This seems to give the title just fine, but the page text
always seems blank. What am I missing?
我无法打开文件(很大),但是我认为这是一个准确的代码段:
I haven't been able to open the file (it's huge), but I think this is an accurate snippet:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.29.0-wmf.12</generator>
<case>first-letter</case>
<namespaces>
...
</namespaces>
</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
\{\{Redr|move|from CamelCase|up\}\}</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
</revision>
</page>
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>766348469</id>
<parentid>766047928</parentid>
<timestamp>2017-02-19T18:08:07Z</timestamp>
<contributor>
<username>GreenC bot</username>
<id>27823944</id>
</contributor>
<minor />
<comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...
</text>
</revision>
</page>
</mediawiki>
推荐答案
文本是指元素标签之间的文本(即< tag> text< / tag>
),而不是所有子元素。因此,在 title
元素的情况下,具有:
The text refers to the text between the element tags (i.e. <tag>text</tag>
) and not to all the child elements. Thus, in case of the title
element one has:
<title>AccessibleComputing</title>
,标记之间的文本为 AccessibleComputing
。
and the text between the tags is AccessibleComputing
.
对于页面
元素,唯一定义的文本是'\ n'
,还有其他子元素(见下文),包括 title
元素:
In the case of the page
element, the only text defined is '\n '
and there are other child elements (see below), including the title
element:
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
...
</page>
如果要解析文件,我建议使用 findall
方法:
If you want to parse the file, I would recomend to use either findall
method:
from lxml import etree
from lxml.etree import tostring
tree = etree.parse('data/enwiki-20190620-pages-articles-multistream.xml')
root = tree.getroot()
# iterate through all the titles
for title in root.findall(".//title", namespaces=root.nsmap):
print(tostring(title))
print(title.text)
生成以下输出:
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">AccessibleComputing</title>\n '
AccessibleComputing
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Anarchism</title>\n '
Anarchism
或 xpath
方法:
nsmap = root.nsmap
nsmap['x'] = root.nsmap[None]
nsmap.pop(None)
# iterate through all the pages
for page in root.findall(".//x:page", namespaces=nsmap):
print(page)
print(repr(page.text)) # which prints '\n '
print('number of children: %i' % len(page.getchildren()))
,输出为:
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc610c8>
'\n '
number of children: 5
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc71bc8>
'\n '
number of children: 5
请参阅<有关更多详细信息,请参见href = https://lxml.de/2.2/tutorial.html rel = nofollow noreferrer> lxml教程。
这篇关于如何使用Python解析Wikipedia XML转储?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!