元素树xml [英] The element tree xml
问题描述
我不知道为什么在尝试到达时间戳时出现错误。 XML格式(省略一些属性):
I can't figure why I get an error while trying to reach the timestamp. XML format (left out some attributes):
编辑:这是xml文件的实际类型。
this is the actual type of the xml file.
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.27.0-wmf.18</generator>
<case>first-letter</case>
<namespaces>...</namespaces>
</siteinfo>
<page>
<title>Zhuangzi</title>
<ns>0</ns>
<id>42870472</id>
<revision>
<id>610251969</id>
<timestamp>2014-05-26T20:08:14Z</timestamp>
<contributor>
<username>White whirlwind</username>
<id>8761551</id>
</contributor>
<comment>...</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
<sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
<revision>...</revision>
</page>
<page>...</page>
</mediawiki>
但是当我尝试以下操作时:
But when I'm trying the following:
for page in root:
for revision in page:
print(revision.find('timestamp').text)
我收到错误
print(revision.find('timestamp').text)
AttributeError: 'NoneType' object has no attribute 'text'
推荐答案
您正在遍历每个标签,因此显然在每个标签上使用 .find
将返回None因此,您的错误:
You are iterating over each tag so obviously using .find
on every tag is going to return None hence your error:
In [9]: for page in root:
print(page.tag)
for revision in page:
print(revision.tag)
...:
id
timestamp
contributor
comment
model
使用您自己的方法,您必须检查每个标签:
using your own method you would have to check each tag:
xml = fromstring(xml)
for page in xml:
for revision in page:
if revision.tag == "timestamp":
print(revision.text)
您可以使用findall获取所有修订标签,然后提取时间戳:
You can use findall to get all the revision tags and then extract the timestamps:
In [1]: xml = """<page>
...: <title>Zhuangzi</title>
...: <ns>0</ns>
...: <id>42870472</id>
...: <revision>
...: <id>610251969</id>
...: <timestamp>2014-05-26T20:08:14Z</timestamp>
...: <contributor>
...: <username>White whirlwind</username>
...: <id>8761551</id>
...: </contributor>
...: <comment>TEXT</comment>
...: <model>wikitext</model>
...: </revision>
...: </page>"""
In [2]: import xml.etree.ElementTree as ET
In [3]: from StringIO import StringIO
In [4]: tree = ET.parse(StringIO(xml))
In [5]: root = tree.getroot()
In [6]: print([r.find("timestamp").text for r in root.findall("revision")])
['2014-05-26T20:08:14Z']
如果使用了 lxml ,您可以使用简单的xpath表达式:
If you used lxml, you could use a simple xpath expression:
from lxml.etree import parse,fromstring
xml = """<page>
<title>Zhuangzi</title>
<ns>0</ns>
<id>42870472</id>
<revision>
<id>610251969</id>
<timestamp>2014-05-26T20:08:14Z</timestamp>
<contributor>
<username>White whirlwind</username>
<id>8761551</id>
</contributor>
<comment>TEXT</comment>
<model>wikitext</model>
</revision>
</page>"""
root = fromstring(xml)
print(root.xpath("//revision/timestamp/text()"))
['2014-05-26T20:08:14Z']
使用已发布的内容,您需要使用名称空间映射:
With what you have posted you need to use a namespace mapping:
tree = ET.parse("your_xml")
root = tree.getroot()
ns = {"wiki":"http://www.mediawiki.org/xml/export-0.10/"}
ts = [ts.text for ts in root.findall(".//wiki:revision//wiki:timestamp", ns) ]
假定所有修订标签都带有时间戳标签。
Presuming all the revision tags have a timestamp tag.
或在xpath中使用lxml:
Or using lxml with an xpath:
from lxml.etree import parse
tree = parse("your_fie")
ns = {"wiki": "http://www.mediawiki.org/xml/export-0.10/"}
print(tree.xpath("//wiki:revision//wiki:timestamp//text()",namespaces=ns))
如果打印
tree = parse("test.xml")
for elem in tree.getiterator():
print elem.tag
输出为:
{http://www.mediawiki.org/xml/export-0.10/}mediawiki
{http://www.mediawiki.org/xml/export-0.10/}siteinfo
{http://www.mediawiki.org/xml/export-0.10/}sitename
{http://www.mediawiki.org/xml/export-0.10/}dbname
{http://www.mediawiki.org/xml/export-0.10/}base
{http://www.mediawiki.org/xml/export-0.10/}generator
{http://www.mediawiki.org/xml/export-0.10/}case
{http://www.mediawiki.org/xml/export-0.10/}namespaces
{http://www.mediawiki.org/xml/export-0.10/}page
.............................
。
这篇关于元素树xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!