从ElementTree findall返回的空列表 [英] Empty list returned from ElementTree findall

查看：101 发布时间：2020/5/25 0:11:25 python xml parsing elementtree wikimedia-dumps

本文介绍了从ElementTree findall返回的空列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是xml解析和Python的新手，请多多包涵.我正在使用lxml来解析Wiki转储，但是我只想要每个页面的标题和文本.

I'm new to xml parsing and Python so bear with me. I'm using lxml to parse a wiki dump, but I just want for each page, its title and text.

现在我已经知道了:

from xml.etree import ElementTree as etree

def parser(file_name):
    document = etree.parse(file_name)
    titles = document.findall('.//title')
    print titles

目前标题未返回任何内容.我已经看过以前这样的答案: ElementTree findall()返回空列表和lxml文档，但是大多数事情似乎都是为解析HTML而量身定制的.

At the moment titles isn't returning anything. I've looked at previous answers like this one: ElementTree findall() returning empty list and the lxml documentation, but most things seemed to be tailored towards parsing HTML.

这是我的XML的一部分:

This is a section of my XML:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/"     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
  <siteinfo>
  <sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.20wmf9</generator>
<case>first-letter</case>
<namespaces>
  <namespace key="-2" case="first-letter">Media</namespace>
  <namespace key="-1" case="first-letter">Special</namespace>
  <namespace key="0" case="first-letter" />
  <namespace key="1" case="first-letter">Talk</namespace>
  <namespace key="2" case="first-letter">User</namespace>
  <namespace key="3" case="first-letter">User talk</namespace>
  <namespace key="4" case="first-letter">Wikipedia</namespace>
  <namespace key="5" case="first-letter">Wikipedia talk</namespace>
  <namespace key="6" case="first-letter">File</namespace>
  <namespace key="7" case="first-letter">File talk</namespace>
  <namespace key="8" case="first-letter">MediaWiki</namespace>
  <namespace key="9" case="first-letter">MediaWiki talk</namespace>
  <namespace key="10" case="first-letter">Template</namespace>
  <namespace key="11" case="first-letter">Template talk</namespace>
  <namespace key="12" case="first-letter">Help</namespace>
  <namespace key="13" case="first-letter">Help talk</namespace>
  <namespace key="14" case="first-letter">Category</namespace>
  <namespace key="15" case="first-letter">Category talk</namespace>
  <namespace key="100" case="first-letter">Portal</namespace>
  <namespace key="101" case="first-letter">Portal talk</namespace>
  <namespace key="108" case="first-letter">Book</namespace>
  <namespace key="109" case="first-letter">Book talk</namespace>
</namespaces>
  </siteinfo>
  <page>
    <title>Aratrum</title>
    <ns>0</ns>
    <id>65741</id>
    <revision>
  <id>349931990</id>
  <parentid>225434394</parentid>
  <timestamp>2010-03-15T02:55:02Z</timestamp>
  <contributor>
    <ip>143.105.193.119</ip>
  </contributor>
  <comment>/* Sources */</comment>
  <sha1>2zkdnl9nsd1fbopv0fpwu2j5gdf0haw</sha1>
  <text xml:space="preserve" bytes="1436">'''Aratrum''' is the Latin word for  [[plough]], and &quot;arotron&quot; (αροτρον) is the [[Greek language|Greek]] word. The   [[Ancient Greece|Greeks]] appear to have had diverse kinds of plough from the earliest  historical records. [[Hesiod]] advised the farmer to have always two ploughs, so that if  one broke the other might be ready for use. These ploughs should be of two kinds, the one  called &quot;autoguos&quot; (αυτογυος, &quot;self-limbed&quot;), in which the plough-tail  was of the same piece of timber as the share-beam and the pole; and the other called  &quot;pekton&quot; (πηκτον, &quot;fixed&quot;), because in it, three parts, which were of  three kinds of timber, were adjusted to one another, and fastened together by nails.

The ''autoguos'' plough was made from a [[sapling]] with two branches growing from its   trunk in opposite directions. In ploughing, the trunk served as the pole, one of the two     branches stood upwards and became the tail, and the other penetrated the ground and,    sometimes shod with bronze or iron, acted as the [[ploughshare]]. 

==Sources==
Based on an article from ''A Dictionary of Greek and Roman Antiquities,'' John Murray,     London, 1875.
ἄρατρον

==External links==
*[http://penelope.uchicago.edu/Thayer/E/Roman/Texts/secondary/SMIGRA*/Aratrum.html Smith's     Dictionary article], with diagrams, further details, sources.
[[Category:Agricultural machinery]]
[[Category:Ancient Greece]]
[[Category:Animal equipment]]</text>
</revision>
</page>

我也尝试过iterparse，然后打印找到的元素的标签:

I've also tried iterparse and then printing the tag of the element it finds:

for e in etree.iterparse(file_name):
    print e.tag

但是它抱怨e没有标签属性.

but it complains about the e not having a tag attribute.

推荐答案

问题是您没有考虑XML名称空间. XML文档(及其中的所有元素)位于http://www.mediawiki.org/xml/export-0.7/命名空间中.要使其正常工作，您需要进行更改

The problem is that you are not taking XML namespaces into account. The XML document (and all the elements in it) is in the http://www.mediawiki.org/xml/export-0.7/ namespace. To make it work, you need to change

titles = document.findall('.//title')

到

titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')

还可以通过namespaces参数提供名称空间:

The namespace can also be provided via the namespaces parameter:

NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)

这在Python 2.7中有效，但

This works in Python 2.7, but it is not explained in the Python 2.7 documentation (the Python 3.3 documentation is better).

另请参见 http://effbot.org/zone/element-namespaces.htm以及带有答案的此类SO问题:通过"ElementTree"在Python中使用命名空间解析XML .

See also http://effbot.org/zone/element-namespaces.htm and this SO question with answer: Parsing XML with namespace in Python via 'ElementTree'.

iterparse() 是由于该函数提供了(event, element)元组(而不仅仅是元素)而引起的.为了获得标签名称，请更改

The trouble with iterparse() is caused by the fact that this function provides (event, element) tuples (not just elements). In order to get the tag name, change

for e in etree.iterparse(file_name):
    print e.tag

对此:

for e in etree.iterparse(file_name):
    print e[1].tag

这篇关于从ElementTree findall返回的空列表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从ElementTree findall返回的空列表 [英] Empty list returned from ElementTree findall

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从ElementTree findall返回的空列表 [英] Empty list returned from ElementTree findall

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭