合并许多XML文件 [英] Merging Lots of XML files

查看:140
本文介绍了合并许多XML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多需要合并的xml文件。我在使用python的ElementTree合并xml文件中尝试过此链接
的代码是(根据我的需要进行编辑):

I have lots of xml files that I need to merge. I have tried this link at merging xml files using python's ElementTree whose code is (Edited as per my need):

import os, os.path, sys
import glob
from xml.etree import ElementTree

def run(files):
    xml_files = glob.glob(files +"/*.xml")
    xml_element_tree = None
    for xml_file in xml_files:
        print xml_file
        data = ElementTree.parse(xml_file).getroot()
        # print ElementTree.tostring(data)
        for result in data.iter('TALLYMESSAGE'):
            if xml_element_tree is None:
                xml_element_tree = data 
                insertion_point = xml_element_tree.findall("./BODY/DATA/TALLYMESSAGE")[0]
            else:
                insertion_point.extend(result) 
    if xml_element_tree is not None:
        f =  open("myxmlfile.xml", "wb")
        f.write(ElementTree.tostring(xml_element_tree))
run("F:/data/data")

但是问题是我有很多XML文件,准确地说是365,每个文件至少2 mb。合并它们都导致我的PC崩溃。
这是我的xml文件的xml树的图像:
img src = https://i.stack.imgur.com/E8CFt.png alt = XML元素树>

But the problem is that I have lots of XML file, 365 to be precise and each one is atleast 2 mb. merging them all has lead to crashing of my PC. This is the image of the xml tree of my xml file:

我的新更新代码是:

import os, os.path, sys
import glob
from lxml import etree
def XSLFILE(files):
    xml_files = glob.glob(files +"/*.xml")
    #print xml_files[0]
    xslstring = """<?xml version="1.0" ?> 
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 
<xsl:template match="/DATA">
<DATA>
<xsl:copy>
<xsl:copy-of select="TALLYMESSAGE"/>\n"""
    #print xslstring
    for xmlfile in xml_files[1:]:
        xslstring = xslstring + '<xsl:copy-of select="document(\'' + xmlfile[-16:] + "')/BODY/DATA/TALLYMESSAGE\"/>\n"
    xslstring = xslstring + """</xsl:copy>+
</DATA>
</xsl:template> 
</xsl:transform>"""
    #print xslstring
    with open("parsingxsl.xsl", "w") as f:
        f.write(xslstring)
    with open(xml_files[0], "r") as f:
        dom = etree.XML(f.read())
    print etree.tostring(dom)
    with open('F:\data\parsingxsl.xsl', "r") as f:
        xslt_tree = etree.XML(f.read())
    print xslt_tree
    transform = etree.XSLT(xslt_tree)
    newdom = transform(dom)
    #print newdom
    tree_out = etree.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)
    print(tree_out)

    xmlfile = open('F:\data\OutputFile.xml','wb')
    xmlfile.write(tree_out)
    xmlfile.close()
XSLFILE("F:\data\data")

运行时相同会产生以下错误:

The same when run creates the following error:

Traceback (most recent call last):
  File "F:\data\xmlmergexsl.py", line 38, in <module>
    XSLFILE("F:\data\data")
  File "F:\data\xmlmergexsl.py", line 36, in XSLFILE
    xmlfile.write(tree_out)
TypeError: must be string or buffer, not None


推荐答案

考虑使用XSLT及其 document()函数来合并XML文件。 Python(像许多面向对象的编程语言一样)像在其lxml模块中一样维护 XSLT处理器。作为信息, XSLT 是一种声明性编程语言,可以转换各种格式和结构的XML文件。

Consider using XSLT and its document() function to merge XML files. Python (like many object-oriented programming languages) maintain an XSLT processor like in its lxml module. As information, XSLT is a declarative programming language to transform XML files in various formats and structures.

出于您的目的,XSLT可能比使用编程代码开发文件更有效,因为在处理过程中,除了XSLT处理器将使用的列表或循环或其他对象都没有保存在内存中外,其他任何对象都没有。

For your purposes, XSLT may be more efficient than using programming code to develop files as no lists or loops or other objects are held in memory during processing except what the XSLT processor would use.

XSLT (将其另存为.xsl文件)

XSLT (to be saved externally as .xsl file)

请考虑一下最初运行Python写文本循环以填充所有365个文档,以避免复制和粘贴。另请注意,第一个文档已跳过,因为它是下面的Python脚本中使用的起点:

<?xml version="1.0" ?> 
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 

 <xsl:template match="DATA">
  <DATA>
    <xsl:copy> 
       <xsl:copy-of select="TALLYMESSAGE"/>
       <xsl:copy-of select="document('Document2.xml')/BODY/DATA/TALLYMESSAGE"/>
       <xsl:copy-of select="document('Document3.xml')/BODY/DATA/TALLYMESSAGE"/>
       <xsl:copy-of select="document('Document4.xml')/BODY/DATA/TALLYMESSAGE"/>
       ...
       <xsl:copy-of select="document('Document365.xml')/BODY/DATA/TALLYMESSAGE"/>             
    </xsl:copy>
  </DATA>
 </xsl:template> 

</xsl:transform>

Python (包含在您的总体脚本中)

Python (to be included in you overall script)

import lxml.etree as ET

dom = ET.parse('C:\Path\To\XML\Document1.xml')
xslt = ET.parse('C:\Path\To\XSL\file.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)

tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)
print(tree_out)

xmlfile = open('C:\Path\To\XML\OutputFile.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()

这篇关于合并许多XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆