使用ElementTree从混合元素xml标签获取文本 [英] Get text from mixed element xml tags with ElementTree
问题描述
我正在使用ElementTree来解析我拥有的XML文档。我从 u
标记中获取文本。其中一些包含混合内容,我需要过滤掉这些内容或将其保留为文本。我有两个示例:
I'm using ElementTree to parse an XML document that I have. I am getting the text from the u
tags. Some of them have mixed content that I need to filter out or keep as text. Two examples that I have are:
<u>
<vocal type="filler">
<desc>eh</desc>
</vocal>¿Sí?
</u>
<u>Pues...
<vocal type="non-ling">
<desc>laugh</desc>
</vocal>A mí no me suena.
</u>
如果类型是 filler,我想在vocal标签中获取文本
,但如果它的类型为 non-ling
则不是。
I want to get the text within the vocal tag if it's type is filler
but not if it's type is non-ling
.
如果我遍历 u
的孩子,那么最后的文本总是会丢失。我可以达到的唯一方法是使用 itertext()
。但是,这样就失去了检查人声标签类型的机会。
If I iterate through the children of u
, somehow the last text bit is always lost. The only way that I can reach it is by using itertext()
. But then the chance to check the type of the vocal tag is lost.
如何解析它,以便得到如下结果:
How can I parse it so that I get a result like this:
eh ¿Sí?
Pues... A mí no me suena.
推荐答案
丢失的文本位¿?和我不喝酒。可以作为每个< vocal>
元素的 tail
属性(文本
The lost text bits, "¿Sí?" and "A mí no me suena.", are available as the tail
property of each <vocal>
element (the text following the element's end tag).
这是获取所需输出的一种方法(使用Python 2.7测试)。
Here is a way to get the wanted output (tested with Python 2.7).
假设vocal.xml看起来像这样:
Assume that vocal.xml looks like this:
<root>
<u>
<vocal type="filler">
<desc>eh</desc>
</vocal>¿Sí?
</u>
<u>Pues...
<vocal type="non-ling">
<desc>laugh</desc>
</vocal>A mí no me suena.
</u>
</root>
代码:
from xml.etree import ElementTree as ET
root = ET.parse("vocal.xml")
for u in root.findall(".//u"):
v = u.find("vocal")
if v.get("type") == "filler":
frags = [u.text, v.findtext("desc"), v.tail]
else:
frags = [u.text, v.tail]
print " ".join(t.encode("utf-8").strip() for t in frags).strip()
输出:
eh ¿Sí?
Pues... A mí no me suena.
这篇关于使用ElementTree从混合元素xml标签获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!