在Python中遍历XML时如何采用前面的元素? [英] How to take preceding element when iterating over XML in Python?
问题描述
我有一个这样的XML:
I have an XML structured like this:
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="192.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text></text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text></text>
</textline>
</textbox>
</page>
</pages>
文本标记中的属性bbox具有四个值,我需要确定元素的第一个bbox值与其前一个值的差.换句话说,前两个bbox之间的距离是1.
Attribute bbox in text tag has four values, and I need to have the difference of the first bbox value of an element and its preceding one. In other words, the distance between the first two bboxes is 1.
到目前为止,我的代码是:
So far my code is:
def wrap(line, idxList):
if len(idxList) == 0:
return # No elements to wrap
# Take the first element from the original location
idx = idxList.pop(0) # Index of the first element
elem = removeByIdx(line, idx) # The indicated element
# Create "newline" element with "elem" inside
nElem = E.newline(elem)
line.insert(idx, nElem) # Put it in place of "elem"
while len(idxList) > 0: # Process the rest of index list
# Value not used, but must be removed
idxList.pop(0)
# Remove the current element from the original location
currElem = removeByIdx(line, idx + 1)
nElem.append(currElem) # Append it to "newline"
for line in root.iter('textline'):
idxList = []
for elem in line:
bbox = elem.attrib.get('bbox')
if bbox is not None:
tbl = bbox.split(',')
distance = float(tbl[2]) - float(tbl[0])
else:
distance = 100 # "Too big" value
if distance > 10:
par = elem.getparent()
idx = par.index(elem)
idxList.append(idx)
else: # "Wrong" element, wrap elements "gathered" so far
wrap(line, idxList)
idxList = []
# Process "good" elements without any "bad" after them, if any
wrap(line, idxList)
但是感兴趣的问题是:
for line in root.iter('textline'):
idxList = []
for elem in line:
bbox = elem.attrib.get('bbox')
if bbox is not None:
tbl = bbox.split(',')
distance = float(tbl[2]) - float(tbl[0])
我尝试了很多,但真的不知道该怎么做.
I tried a lot and really don't know how to do it.
推荐答案
如果我完全了解您的需求,则希望选择符合以下条件的文本节点:
If I fully understand your needs, you want to select text nodes which respect the following condition :
文本节点的bbox值-前面的文本节点的bbox值不大于10.
bbox value of the text node - bbox value of the preceding text nodes not greater than 10.
您可以尝试使用XSL和XPath.首先是XSL代码(在下一步将bbox值与XPath进行比较的强制步骤):
You could try with XSL and XPath. First the XSL code (mandatory step to compare bbox value with XPath in the next step) :
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="no" indent="yes"/>
<xsl:template match="@bbox">
<xsl:attribute name="{name()}">
<xsl:value-of select="substring(.,1,3)" />
</xsl:attribute>
</xsl:template>
<xsl:template match="@font">
<xsl:attribute name="{name()}">
<xsl:text>NUMPTY+ImprintMTnum</xsl:text>
</xsl:attribute>
</xsl:template>
<xsl:template match="*[not(node())]"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
然后:
import lxml.etree as IP
xml = IP.parse(xml_filename)
xslt = IP .parse(xsl_filename)
transform = IP.XSLT(xslt)
然后请求:
tree = IP.parse(transform)
for nodes in tree.xpath("//text[@bbox<preceding::text[1]/@bbox+11]"):
print(nodes)
用//text[@bbox>preceding::text[1]/@bbox]
替换//text[@bbox<preceding::text[1]/@bbox+11]
以测试示例数据(将选择bbox值大于先前文本bbox值的文本节点).
Replace //text[@bbox<preceding::text[1]/@bbox+11]
with //text[@bbox>preceding::text[1]/@bbox]
to test with your sample data (will select text nodes with greater bbox value than the preceding text bbox value).
这篇关于在Python中遍历XML时如何采用前面的元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!