如何删除XML文件中的重复元素 [英] How to delete duplicated elements in XML file

查看:160
本文介绍了如何删除XML文件中的重复元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的XML文件:它包含重复的元素<houseNum>0</houseNum>.

Here is my XML file: it contains a duplicated element <houseNum>0</houseNum>.

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfHouse>
<XmlForm>
<houseNum>0</houseNum>
 <plan1> 
  <coord>
    <X> 1.2  </X>
    <Y> 2.1  </Y>
    <Z> 3.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 0   </B>
  </color>
 </plan1>
 <plan2>
  <coord>  
    <X> 21.2  </X>
    <Y> 22.1  </Y>
    <Z> 31.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 0   </B>
</color>
 </plan2> 
</XmlForm>
<XmlForm>
<houseNum>0</houseNum>
 <plan1> 
  <coord>
    <X> 1.2  </X>
    <Y> 2.1  </Y>
    <Z> 3.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 0   </B>
  </color>
 </plan1>
 <plan2>
  <coord>  
    <X> 21.2  </X>
    <Y> 22.1  </Y>
    <Z> 31.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 0   </B>
</color>
 </plan2> 
</XmlForm>

<XmlForm>
<houseNum>1</houseNum>
 <plan1> 
  <coord>
    <X> 11.2  </X>
    <Y> 12.1  </Y>
    <Z> 13.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 255   </G>
    <B> 0   </B>
  </color>
 </plan1>
 <plan2>
  <coord>  
    <X> 211.2  </X>
    <Y> 212.1  </Y>
    <Z> 311.0  </Z>
  </coord>
  <color> 
    <R> 255 </R>
    <G> 0   </G>
    <B> 255   </B>
</color>
 </plan2> 
</XmlForm>
</ArrayOfHouse>

就我而言,有两种重复类型:

1)如果重复的元素是连续的,这是删除重复元素的代码,我只是比较element [i]和element [i + 1](如果这些元素是elemet [i] .text = = element [i + 1] .text,我删除了element [i + 1]

from lxml import etree
def Remove_Duplication_XML(xml_file):
    base_name = os.path.basename(xml_file)
    start_time = time.time()
    tree = etree.parse(xml_file)

    # remove duplicate skeletons
    root = tree.getroot()
    elementlist = [e for e in root.iter('houseNum')]
    numframes=[x.text for x in elementlist]
    print(numframes)
    for index_element in range(1, len(elementlist)):

        try:
            if elementlist[index_element].text == elementlist[index_element - 1].text:
                elementlist[index_element].getparent().remove(elementlist[index_element])
                print(elementlist[index_element].text)

        except:
            print(' except  ')

    # String xml without duplication
    file = etree.tostring(root).decode("utf-8")
    print(file)

2)如果重复的元素不是连续的,那么我正在寻找一条工作要做.有帮助吗?

推荐答案

考虑 XSLT ,用于转换XML文件的专用语言(类似于使用SQL,也是专用于查询数据库).而且,由于您已经使用了Python的lxml,因此可以无缝运行这样的脚本,而无需单个for循环或if逻辑即可删除文档中任何地方的重复 .

Consider XSLT, the special-purpose language designed to transform XML files (analoguous to using SQL, also special-purpose, to query databases). And because you already use Python's lxml you can seamlessly run such a script without a single for loop or if logic to remove duplicates anywhere in the document.

具体来说,运行Xalt 1.0方法 Muenchian分组,使用<xsl:key>通过 houseNum 为XML文档建立索引,然后返回不同的分组.额外的好处是,XSLT之下甚至还删除了带有漂亮打印缩进的文本节点中的空白:

Specifically, run the Muenchian Grouping, an XSLT 1.0 method, to index your XML document by the houseNum using <xsl:key> and then return distinct groupings. With an added bonus, below XSLT even removes the white spaces in text nodes with pretty print indentation:

XSLT (另存为.xsl文件,一个特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>

  <xsl:key name="id" match="XmlForm" use="houseNum" />

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="XmlForm[generate-id() != generate-id(key('id', houseNum))]"/>

  <xsl:template match="text()">
    <xsl:value-of select="normalize-space(.)"/>
  </xsl:template>

</xsl:stylesheet>

Python

import os
import lxml.etree as et

# LOAD XML AND XSL FILES
xml = et.parse('Source.xml')
xsl = et.parse('XSLTScript.xsl')

# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(xml)

# PRINT RESULT TO SCREEN
print(result)

# SAVE RESULT TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

输出 (注意,文本值被修剪为空白)

<?xml version="1.0"?>
<ArrayOfHouse>
  <XmlForm>
    <houseNum>0</houseNum>
    <plan1>
      <coord>
        <X>1.2</X>
        <Y>2.1</Y>
        <Z>3.0</Z>
      </coord>
      <color>
        <R>255</R>
        <G>0</G>
        <B>0</B>
      </color>
    </plan1>
    <plan2>
      <coord>
        <X>21.2</X>
        <Y>22.1</Y>
        <Z>31.0</Z>
      </coord>
      <color>
        <R>255</R>
        <G>0</G>
        <B>0</B>
      </color>
    </plan2>
  </XmlForm>
  <XmlForm>
    <houseNum>1</houseNum>
    <plan1>
      <coord>
        <X>11.2</X>
        <Y>12.1</Y>
        <Z>13.0</Z>
      </coord>
      <color>
        <R>255</R>
        <G>255</G>
        <B>0</B>
      </color>
    </plan1>
    <plan2>
      <coord>
        <X>211.2</X>
        <Y>212.1</Y>
        <Z>311.0</Z>
      </coord>
      <color>
        <R>255</R>
        <G>0</G>
        <B>255</B>
      </color>
    </plan2>
  </XmlForm>
</ArrayOfHouse>

这篇关于如何删除XML文件中的重复元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆