如何在XML Python中遍历子元素的子元素? [英] How to iterate over child of child elements in XML Python?

查看:102
本文介绍了如何在XML Python中遍历子元素的子元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个XML结构,如下所示:

I have an XML structured like:

<pages>
 <page>
  <textbox>
    <new_line>
     <text>
     </text>
    </new_line>
  </textbox>
 </page>
</pages>

我正在遍历作为new_line元素子元素的text元素,以加入具有相同size属性的标签.但是我想指定new_line元素必须在textbox元素内.我尝试在代码中添加一个for循环,但它根本不起作用.这是代码:

I'm iterating over text elements that are children of the new_line element to join tags with the same size attribute. But I want to specify that the new_line element has to be inside the textbox element. I tried adding a for loop in my code but it simply doesn't work. Here is the code:

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('output22.xml', parser)
root = tree.getroot()

# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
    # Find all "text" element in the new_line block
    list_text_elts = new_line_block.findall('text')

    # Iterate over all of them with the current and previous ones
    for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
        # Get size elements
        prev_size = previous_text.attrib.get('size')
        curr_size = current_text.attrib.get('size')
        # If they are equals and not both null
        if curr_size == prev_size and curr_size is not None:
            # Get current and previous text
            pt = previous_text.text if previous_text.text is not None else ""
            ct = current_text.text if current_text.text is not None else ""
            # Add them to current element
            current_text.text = pt + ct
            # Remove preivous element
            previous_text.getparent().remove(previous_text)



newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("utf-8")
print(newtree)
with open("output2.xml", "wb") as f:
    f.write(newtree)

示例字符串:

"""<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""

预期输出:

<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">API</text>
                <text size="12.482">TOLO</text>
                <text/>
                <text size="12.482">III</text>
                <text/>
            </new_line>
        </textbox>
    </page>
</pages>

推荐答案

您可以定义一个递归函数来解决您的案例中的多层XML. 我为此问题写了一个简码.

You can define a recursive function to solve the multi-layer XML in your case. I wrote a shortcode for this problem.

import sys
import xml.etree.ElementTree as etree

def add_sub_element(parent, tag, attrib, text='None'):
    new_feed = etree.SubElement(parent, tag, attrib)

    if(text):
        new_feed.text = text

    return new_feed


def my_tree_mapper(parent_tag, current, element):

    if(current.tag == 'new_line' and parent_tag == 'textbox'):

        current_size = -1
        current_text = ""

        for child in element:
            child_tag = child.tag
            child_attrib = child.attrib
            child_text = child.text

            if(child_tag == 'text' and 'size' in child_attrib):
                if(child_attrib['size'] == current_size):
                    # For 'text' children with the same size
                    # Append text until we got a different size
                    current_text = current_text + child_text
                else:
                    if(current_size != -1):
                        # Add sub element into the tree when we got a different size
                        sub_element = add_sub_element(
                            current, child_tag, {'size': current_size}, current_text)

                    current_size = child_attrib['size']
                    current_text = child_text

            else:
                if(current_size != -1):
                    # Or add sub element into the tree when we got different tag
                    sub_element = add_sub_element(
                        current, child_tag, {'size': current_size}, current_text)

                # No logic for different tag
                sub_element = add_sub_element(
                    current, child_tag, child_attrib, child_text)
                my_tree_mapper(current.tag, sub_element, child)

                current_size = -1
                current_text = ""
    else:
        # No logic if not satisfy the condition
        for child in element:
            child_tag = child.tag
            child_attrib = child.attrib
            child_text = child.text

            sub_element = add_sub_element(
                current, child_tag, child_attrib, child_text)
            my_tree_mapper(current.tag, sub_element, child)


the_input = """<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""

tree = etree.ElementTree(etree.fromstring(the_input))
root = tree.getroot()
new_root = etree.Element(root.tag, root.attrib)

my_tree_mapper('', new_root, root)
print(etree.tostring(new_root))

希望这可以帮助您,或者至少可以给您一些想法.

Hope this can help you, or at least give you some idea.

(如果您想了解有关递归函数的更多信息,请文档和示例.并且有关XML etree方法的更多信息此处)

(In case you want to read more about Incursive Functions document and example. And more about XML etree methods here)

这篇关于如何在XML Python中遍历子元素的子元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆