Python lxml-如何删除重复的空标签 [英] Python lxml - How to remove empty repeated tags

查看:67
本文介绍了Python lxml-如何删除重复的空标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些由脚本生成的XML,该脚本可能有也可能没有空元素.有人告诉我,现在在XML中不能有空元素.下面是一个例子:

I have some XML that is generated by a script that may or may not have empty elements. I was told that now we cannot have empty elements in the XML. Here is an example:

<customer>  
    <govId>
       <id>@</id>
       <idType>SSN</idType>
           <issueDate/>
           <expireDate/>
           <dob/>
           <state/>
           <county/>
           <country/>
    </govId>
    <govId>
        <id/>
        <idType/>
        <issueDate/>
        <expireDate/>
        <dob/>
        <state/>
        <county/>
        <country/>
    </govId>
</customer>

输出应如下所示:

<customer>  
    <govId>
       <id>@</id>
       <idType>SSN</idType>        
    </govId>        
</customer>

我需要删除所有空元素.您会注意到,我的代码删除了"govId"子元素中的空白内容,但第二个却未删除任何内容.我现在正在使用lxml.objectify.

I need to remove all the empty elements. You'll note that my code took out the empty stuff in the "govId" sub-element, but didn't take out anything in the second. I am using lxml.objectify at the moment.

这基本上是我在做什么:

Here is basically what I am doing:

root = objectify.fromstring(xml)
for customer in root.customers.iterchildren():
    for e in customer.govId.iterchildren():
        if not e.text:
            customer.govId.remove(e)

有人知道使用lxml objectify做到这一点的方法吗?还是有更简单的方法?如果所有元素都为空,我也想整体删除第二个"govId"元素.

Does anyone know of a way to do this with lxml objectify or is there an easier way period? I would also like to remove the second "govId" element in its entirety if all its elements are empty.

推荐答案

首先,您代码的问题在于您要遍历客户,而不是遍历 govIds .在第三行,为每个客户选择 first govId ,然后遍历其子级.因此,您需要另一个 for 循环,以使代码按预期运行.

First of all, the problem with your code is that you are iterating over customers, but not over govIds. On the third line you take the first govId for every customer, and iterate over its children. So, you'd need a another for loop for the code to work like you intended it to.

问题末尾的一小段句子使问题变得更加复杂:如果所有第二个"govId"元素为空,我也想整体删除第二个"govId"元素.

This small sentence at the end of your question then makes the problem quite a bit more complex: I would also like to remove the second "govId" element in its entirety if all its elements are empty.

这意味着,除非您只想对一个嵌套级别进行硬编码,否则需要递归检查一个元素及其子元素是否为空.例如:

This means, unless you want to hard code just checking one level of nesting, you need to recursively check if an element and it's children are empty. Like this for example:

def recursively_empty(e):
   if e.text:
       return False
   return all((recursively_empty(c) for c in e.iterchildren()))

注意:由于使用了 all()内置.

Note: Python 2.5+ because of the use of the all() builtin.

然后,您可以将代码更改为类似的内容,以删除文档中所有一直为空的元素.

You then can change your code to something like this to remove all the elements in the document that are empty all the way down.

# Walk over all elements in the tree and remove all
# nodes that are recursively empty
context = etree.iterwalk(root)
for action, elem in context:
    parent = elem.getparent()
    if recursively_empty(elem):
        parent.remove(elem)

示例输出:

<customer>
  <govId>
    <id>@</id>
    <idType>SSN</idType>
  </govId>
</customer>

您可能想做的一件事是在递归函数中优化条件 if e.text:.目前,这会将 None 和空字符串视为空,但不考虑空格和换行符之类的空格.如果 str.strip() ,请使用 str.strip() 空".

One thing you might want to do is refine the condition if e.text: in the recursive function. Currently this will consider None and the empty string as empty, but not whitespace like spaces and newlines. Use str.strip() if that's part of your definition of "empty".

编辑:如@Dave所指出,可以通过使用

Edit: As pointed out by @Dave, the recursive function could be improved by using a generator expression:

return all((recursively_empty(c) for c in e.getchildren()))

这不会一次为所有孩子评估 recursively_empty(c),而是为每个孩子懒惰地对其进行评估.由于 all()将在第一个 False 元素上停止迭代,因此这可能意味着性能上的显着提高.

This will not evaluate recursively_empty(c) for all the children at once, but evaluate it for each one lazily. Since all() will stop iteration upon the first False element, this could mean a significant performance improvement.

编辑2 :可以使用 e.iterchildren()而不是 e.getchildren()进一步优化表达式.这适用于 lxml etree API objectifyAPI .

Edit 2: The expression can be further optimized by using e.iterchildren() instead of e.getchildren(). This works with the lxml etree API and the objectify API.

这篇关于Python lxml-如何删除重复的空标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆