Python LXML-获取标签文本的索引 [英] Python lxml - get index of tag's text

查看：387 发布时间：2020/5/4 8:29:30 python xml lxml

本文介绍了Python LXML-获取标签文本的索引的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个格式类似于docx的xml文件，即:

I have an xml-file with a format similar to docx, i.e.:

<w:r>
  <w:rPr>
    <w:sz w:val="36"/>
    <w:szCs w:val="36"/>
  </w:rPr>
  <w:t>BIG_TEXT</w:t>
</w:r>

我需要在源xml中获得"BIG_TEXT"的索引，例如:

I need to get an index of "BIG_TEXT" in source xml, like:

from lxml import etree
text = open('/devel/tmp/doc2/word/document.xml', 'r').read()

root = etree.XML(text)

start = 0
for e in root.iter("*"):
    if e.text:
        offset = text.index(e.text, start)
        l = len(e.text)
        print 'Text "%s" at offset %s and len=%s' % (e.text, offset, l)
        start = offset + l

我可以从当前索引+ len(text)的位置开始新的搜索，但是还有另一种方法吗?元素可以有一个字符，例如w.它将找到w的索引，但不会找到标签文本w的索引.

I can start a new search from position of current index + len(text), but is there another way? Element may have one character, w for example. It will find index of w, but not index of tag text w.

推荐答案

我一直在寻找类似的解决方案(在大型xml文件中为节点建立索引以便快速查找).

I was looking for a similar solution (indexing nodes in a big xml file for fast lookup).

AFAIK，lxml仅提供源代码行，这是不够的. cf API :Original line number as found by the parser or None if unknown.

但是expat在文件中提供了确切的偏移量: CurrentByteIndex> CurrentByteIndex .

从start_element处理程序中获取，它返回标签的开始(即'<')偏移量.
从char_data处理程序中获取，它返回数据的起始偏移量(在您的示例中为'B').

AFAIK, lxml only offers sourceline, which is insufficient. Cf API : Original line number as found by the parser or None if unknown.

But expat provides the exact offset in the file : CurrentByteIndex.

Fetched from start_element handler, it returns tag's start (ie '<') offset.
Fetched from char_data handler, it returns data's start (ie 'B' in your example) offset.

示例:

import xml.parsers.expat

# handler functions for parser events, and housekeeping.
class handler :
   def __init__(self, current_parser) :
      #tag of interest
      self.TARGET_TAG = "w:t"

      #set up parser
      self.parser = current_parser
      self.parser.StartElementHandler  = self.start_element
      self.parser.EndElementHandler    = self.end_element
      self.parser.CharacterDataHandler = self.char_data

      self.target_tag_met = False
      self.index = None

   def start_element(self, name, attrs):
      self.target_tag_met = (name == self.TARGET_TAG)

   def end_element(self, name) :
      self.target_tag_met = False

   def char_data(self, data):
      if self.target_tag_met :
         self.index = self.parser.CurrentByteIndex

#open file in binary mode for robuster byte offsets.
xmlFile = open("so_test.xml", 'rb')

p = xml.parsers.expat.ParserCreate()
h = handler(p)

p.ParseFile(xmlFile)
print (h.index)

这篇关于Python LXML-获取标签文本的索引的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python LXML-获取标签文本的索引 [英] Python lxml - get index of tag's text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python LXML-获取标签文本的索引 [英] Python lxml - get index of tag&#39;s text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Python LXML-获取标签文本的索引 [英] Python lxml - get index of tag's text

登录关闭