使用Python的xml.etree查找元素的开始和结束字符偏移量 [英] Using Python's xml.etree to find element start and end character offsets

查看:78
本文介绍了使用Python的xml.etree查找元素的开始和结束字符偏移量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的XML数据如下:

<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>

我希望能够提取:


  1. 当前在etree中提供的XML元素。

  2. 文档的纯文本,在开始和结束标记之间。

  3. 每个起始元素在纯文本中的位置,以字符偏移量表示。

(3 )是目前最重要的要求; etree提供(1)很好。

(3) is the most important requirement right now; etree provides (1) fine.

我看不到有任何直接方法(3),但是希望遍历文档树中的元素会返回许多小字符串可以重新组装,从而提供(2)和(3)。但是,请求根节点的.text仅返回根节点与第一个元素之间的文本,例如的首都 。

I cannot see any way to do (3) directly, but hoped that iterating through the elements in the document tree would return many small string that could be re-assembled, thus providing (2) and (3). However, requesting the .text of the root node only returns text between the root node and the first element, e.g. "The capital of ".

使用SAX进行(1)可能涉及实现许多已经多次编写的内容,例如极简主义和etree。对于要放入此代码的软件包,不能选择使用lxml。有人可以帮忙吗?

Doing (1) with SAX could involve implementing a lot that's already been written many times over, in e.g. minidom and etree. Using lxml isn't an option for the package that this code is to go into. Can anybody help?

推荐答案

iterparse()函数在 xml.etree

import xml.etree.cElementTree as etree

for event, elem in etree.iterparse(file, events=('start', 'end')):
    if event == 'start':
       print(elem.tag) # use only tag name and attributes here
    elif event == 'end':
       # elem children elements, elem.text, elem.tail are available
       if elem.text is not None and elem.tail is not None:
          print(repr(elem.tail))

另一个选择是覆盖 start() data() end() etree.TreeBuilder()的方法:

Another option is to override start(), data(), end() methods of etree.TreeBuilder():

from xml.etree.ElementTree import XMLParser, TreeBuilder

class MyTreeBuilder(TreeBuilder):

    def start(self, tag, attrs):
        print("&lt;%s>" % tag)
        return TreeBuilder.start(self, tag, attrs)

    def data(self, data):
        print(repr(data))
        TreeBuilder.data(self, data)

    def end(self, tag):
        return TreeBuilder.end(self, tag)

text = """<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>"""

# ElementTree.fromstring()
parser = XMLParser(target=MyTreeBuilder())
parser.feed(text)
root = parser.close() # return an ordinary Element



输出



Output

<xml>
'\nThe captial of '
<place>
'South Africa'
' is '
<place>
'Pretoria'
'.\n'

这篇关于使用Python的xml.etree查找元素的开始和结束字符偏移量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆