如何在XML Python中遍历子元素的子元素? [英] How to iterate over child of child elements in XML Python?
问题描述
我有一个XML结构,如下所示:
I have an XML structured like:
<pages>
<page>
<textbox>
<new_line>
<text>
</text>
</new_line>
</textbox>
</page>
</pages>
我正在遍历作为new_line
元素子元素的text
元素,以加入具有相同size
属性的标签.但是我想指定new_line
元素必须在textbox
元素内.我尝试在代码中添加一个for循环,但它根本不起作用.这是代码:
I'm iterating over text
elements that are children of the new_line
element to join tags with the same size
attribute. But I want to specify that the new_line
element has to be inside the textbox
element. I tried adding a for loop in my code but it simply doesn't work. Here is the code:
import lxml.etree as etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('output22.xml', parser)
root = tree.getroot()
# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
# Find all "text" element in the new_line block
list_text_elts = new_line_block.findall('text')
# Iterate over all of them with the current and previous ones
for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
# Get size elements
prev_size = previous_text.attrib.get('size')
curr_size = current_text.attrib.get('size')
# If they are equals and not both null
if curr_size == prev_size and curr_size is not None:
# Get current and previous text
pt = previous_text.text if previous_text.text is not None else ""
ct = current_text.text if current_text.text is not None else ""
# Add them to current element
current_text.text = pt + ct
# Remove preivous element
previous_text.getparent().remove(previous_text)
newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("utf-8")
print(newtree)
with open("output2.xml", "wb") as f:
f.write(newtree)
示例字符串:
"""<?xml version="1.0" encoding="utf-8"?>
<pages>
<page>
<textbox>
<new_line>
<text size="12.482">C</text>
<text size="12.333">A</text>
<text size="12.333">P</text>
<text size="12.333">I</text>
<text size="12.482">T</text>
<text size="12.482">O</text>
<text size="12.482">L</text>
<text size="12.482">O</text>
<text></text>
<text size="12.482">I</text>
<text size="12.482">I</text>
<text size="12.482">I</text>
<text></text>
</new_line>
</textbox>
</page>
</pages>
"""
预期输出:
<pages>
<page>
<textbox>
<new_line>
<text size="12.482">C</text>
<text size="12.333">API</text>
<text size="12.482">TOLO</text>
<text/>
<text size="12.482">III</text>
<text/>
</new_line>
</textbox>
</page>
</pages>
推荐答案
您可以定义一个递归函数来解决您的案例中的多层XML. 我为此问题写了一个简码.
You can define a recursive function to solve the multi-layer XML in your case. I wrote a shortcode for this problem.
import sys
import xml.etree.ElementTree as etree
def add_sub_element(parent, tag, attrib, text='None'):
new_feed = etree.SubElement(parent, tag, attrib)
if(text):
new_feed.text = text
return new_feed
def my_tree_mapper(parent_tag, current, element):
if(current.tag == 'new_line' and parent_tag == 'textbox'):
current_size = -1
current_text = ""
for child in element:
child_tag = child.tag
child_attrib = child.attrib
child_text = child.text
if(child_tag == 'text' and 'size' in child_attrib):
if(child_attrib['size'] == current_size):
# For 'text' children with the same size
# Append text until we got a different size
current_text = current_text + child_text
else:
if(current_size != -1):
# Add sub element into the tree when we got a different size
sub_element = add_sub_element(
current, child_tag, {'size': current_size}, current_text)
current_size = child_attrib['size']
current_text = child_text
else:
if(current_size != -1):
# Or add sub element into the tree when we got different tag
sub_element = add_sub_element(
current, child_tag, {'size': current_size}, current_text)
# No logic for different tag
sub_element = add_sub_element(
current, child_tag, child_attrib, child_text)
my_tree_mapper(current.tag, sub_element, child)
current_size = -1
current_text = ""
else:
# No logic if not satisfy the condition
for child in element:
child_tag = child.tag
child_attrib = child.attrib
child_text = child.text
sub_element = add_sub_element(
current, child_tag, child_attrib, child_text)
my_tree_mapper(current.tag, sub_element, child)
the_input = """<?xml version="1.0" encoding="utf-8"?>
<pages>
<page>
<textbox>
<new_line>
<text size="12.482">C</text>
<text size="12.333">A</text>
<text size="12.333">P</text>
<text size="12.333">I</text>
<text size="12.482">T</text>
<text size="12.482">O</text>
<text size="12.482">L</text>
<text size="12.482">O</text>
<text></text>
<text size="12.482">I</text>
<text size="12.482">I</text>
<text size="12.482">I</text>
<text></text>
</new_line>
</textbox>
</page>
</pages>
"""
tree = etree.ElementTree(etree.fromstring(the_input))
root = tree.getroot()
new_root = etree.Element(root.tag, root.attrib)
my_tree_mapper('', new_root, root)
print(etree.tostring(new_root))
希望这可以帮助您,或者至少可以给您一些想法.
Hope this can help you, or at least give you some idea.
(如果您想了解有关递归函数的更多信息,请文档和示例.并且有关XML etree方法的更多信息此处)
(In case you want to read more about Incursive Functions document and example. And more about XML etree methods here)
这篇关于如何在XML Python中遍历子元素的子元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!