Python LXML-获取标签文本的索引 [英] Python lxml - get index of tag's text
本文介绍了Python LXML-获取标签文本的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个格式类似于docx的xml文件,即:
I have an xml-file with a format similar to docx, i.e.:
<w:r>
<w:rPr>
<w:sz w:val="36"/>
<w:szCs w:val="36"/>
</w:rPr>
<w:t>BIG_TEXT</w:t>
</w:r>
我需要在源xml中获得"BIG_TEXT"的索引,例如:
I need to get an index of "BIG_TEXT" in source xml, like:
from lxml import etree
text = open('/devel/tmp/doc2/word/document.xml', 'r').read()
root = etree.XML(text)
start = 0
for e in root.iter("*"):
if e.text:
offset = text.index(e.text, start)
l = len(e.text)
print 'Text "%s" at offset %s and len=%s' % (e.text, offset, l)
start = offset + l
我可以从当前索引+ len(text)的位置开始新的搜索,但是还有另一种方法吗?元素可以有一个字符,例如w
.它将找到w
的索引,但不会找到标签文本w
的索引.
I can start a new search from position of current index + len(text), but is there another way? Element may have one character, w
for example. It will find index of w
, but not index of tag text w
.
推荐答案
我一直在寻找类似的解决方案(在大型xml文件中为节点建立索引以便快速查找).
I was looking for a similar solution (indexing nodes in a big xml file for fast lookup).
- AFAIK,
lxml
仅提供源代码行,这是不够的. cf API :Original line number as found by the parser or None if unknown.
- 但是
expat
在文件中提供了确切的偏移量: CurrentByteIndex> CurrentByteIndex 一个>.- 从
start_element
处理程序中获取,它返回标签的开始(即'<'
)偏移量. - 从
char_data
处理程序中获取,它返回数据的起始偏移量(在您的示例中为'B'
).
- AFAIK,
lxml
only offers sourceline, which is insufficient. Cf API :Original line number as found by the parser or None if unknown.
- But
expat
provides the exact offset in the file : CurrentByteIndex.- Fetched from
start_element
handler, it returns tag's start (ie'<'
) offset. - Fetched from
char_data
handler, it returns data's start (ie'B'
in your example) offset.
示例:
import xml.parsers.expat # handler functions for parser events, and housekeeping. class handler : def __init__(self, current_parser) : #tag of interest self.TARGET_TAG = "w:t" #set up parser self.parser = current_parser self.parser.StartElementHandler = self.start_element self.parser.EndElementHandler = self.end_element self.parser.CharacterDataHandler = self.char_data self.target_tag_met = False self.index = None def start_element(self, name, attrs): self.target_tag_met = (name == self.TARGET_TAG) def end_element(self, name) : self.target_tag_met = False def char_data(self, data): if self.target_tag_met : self.index = self.parser.CurrentByteIndex #open file in binary mode for robuster byte offsets. xmlFile = open("so_test.xml", 'rb') p = xml.parsers.expat.ParserCreate() h = handler(p) p.ParseFile(xmlFile) print (h.index)
这篇关于Python LXML-获取标签文本的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- Fetched from
- 从
查看全文