从大型 xml 文件中修剪一些元素 [英] Prune some elements from large xml file

查看:26
本文介绍了从大型 xml 文件中修剪一些元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个超过 1GB 的 xml 文件,我想通过创建一个新的 xml 文件或重写现有的 xml 文件来删除父标签的不需要的子项来减小文件的大小.由于文件很大,如何通过 python 完成此操作,简单解析 tree = ElementTree.parse(xmlfile) 将不起作用.

I have a xml file of more than 1GB and I want to reduce the size of the file by removing unwanted children of a parent tag by creating a new xml file or rewriting the existing one. How this can be done through python as the file is large,simple parse tree = ElementTree.parse(xmlfile) won't work.

XML 文件

在每个父标签的文件 TasksReportNode 我只想让子 TableRow 具有值为 0 的 RowCount 属性并拒绝所有其他该父级的子级(表行).

In the file for every parent tag TasksReportNode I want to have only the child TableRow with RowCount attribute with value 0 and reject all other children(Table Row) of that parent.

示例 XML 代码:

<TasksReportNode Name="Task15">
    <TableData NumRows="97" NumColumns="15">
        <TableRow RowCount="0">
            <TableColumn Name="Task"><![CDATA[   Task15 [GET - /PULSEV31/appView/projectFeedHidden.jsp - 200]]]></TableColumn>
            <TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
            <TableColumn Name="Successful"><![CDATA[96]]></TableColumn>
            <TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Total"><![CDATA[96]]></TableColumn>
            <TableColumn Name="Min(ms)"><![CDATA[15]]></TableColumn>
            <TableColumn Name="Avg(ms)"><![CDATA[24.20]]></TableColumn>
            <TableColumn Name="Avg-90%(ms)"><![CDATA[54.55]]></TableColumn>
            <TableColumn Name="90%ile(ms)"><![CDATA[89.98]]></TableColumn>
            <TableColumn Name="95%ile(ms)"><![CDATA[95.24]]></TableColumn>
            <TableColumn Name="99%ile(ms)"><![CDATA[99.45]]></TableColumn>
            <TableColumn Name="Max(ms)"><![CDATA[94]]></TableColumn>
            <TableColumn Name="Std. Dev."><![CDATA[15.74]]></TableColumn>
            <TableColumn Name="Bytes Recd(KB)"><![CDATA[192]]></TableColumn>
        </TableRow>
        <TableRow RowCount="1">
            <TableColumn Name="Task"><![CDATA[      VirtualUser1]]></TableColumn>
            <TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
            <TableColumn Name="Successful"><![CDATA[1]]></TableColumn>
            <TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Total"><![CDATA[1]]></TableColumn>
            <TableColumn Name="Min(ms)"><![CDATA[934]]></TableColumn>
            <TableColumn Name="Avg(ms)"><![CDATA[934.00]]></TableColumn>
            <TableColumn Name="Avg-90%(ms)"><![CDATA[950.00]]></TableColumn>
            <TableColumn Name="90%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
            <TableColumn Name="95%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
            <TableColumn Name="99%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
            <TableColumn Name="Max(ms)"><![CDATA[934]]></TableColumn>
            <TableColumn Name="Std. Dev."><![CDATA[0.00]]></TableColumn>
            <TableColumn Name="Bytes Recd(KB)"><![CDATA[0]]></TableColumn>
    </TableData>
    <TableData NumRows="1" NumColumns="2">
        <TableRow RowCount="0">
            <TableColumn Name="Response Time Interval (ms)"><![CDATA[0 - 99]]></TableColumn>
            <TableColumn Name="Frequency"><![CDATA[96]]></TableColumn>
        </TableRow>
    </TableData>
</TasksReportNode>
<TasksReportNode Name="Task16">
    <TableData NumRows="97" NumColumns="15">
        <TableRow RowCount="0">
            <TableColumn Name="Task"><![CDATA[   Task16 [GET - /PULSEV31/appView/projectCommentHidden.jsp - 200]]]></TableColumn>
            <TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
            <TableColumn Name="Successful"><![CDATA[96]]></TableColumn>
            <TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Total"><![CDATA[96]]></TableColumn>
            <TableColumn Name="Min(ms)"><![CDATA[15]]></TableColumn>
            <TableColumn Name="Avg(ms)"><![CDATA[22.73]]></TableColumn>
            <TableColumn Name="Avg-90%(ms)"><![CDATA[54.55]]></TableColumn>
            <TableColumn Name="90%ile(ms)"><![CDATA[90.93]]></TableColumn>
            <TableColumn Name="95%ile(ms)"><![CDATA[96.25]]></TableColumn>
            <TableColumn Name="99%ile(ms)"><![CDATA[100.50]]></TableColumn>
            <TableColumn Name="Max(ms)"><![CDATA[109]]></TableColumn>
            <TableColumn Name="Std. Dev."><![CDATA[14.76]]></TableColumn>
            <TableColumn Name="Bytes Recd(KB)"><![CDATA[192]]></TableColumn>
        </TableRow>
    </TableData>
</TasksReportNode>

这是我尝试过的:

xmL = 'F:\\Reports\\Logs\\Result_TG1_V16.xml'

context = etree.iterparse(xmL,  events=("start", "end"),)
for event, element in context:
if element.tag == 'TasksReportNode':
    for child1 in element:
        for child2 in child1:
        if child2.get("RowCount") == "0":
            for child3 in child2:
            print(child3.tag, child3.text)
element.clear() # discard the element
del context

现在我们拥有值为 '0' 的所有 RowCount 并且可以将其添加到父级,而保留所有其他兄弟级.

Now we have all the RowCount with value '0' and that can be added to parent, leaving all other siblings.

推荐答案

我建议使用 lxml,因为它在大多数方面比 stdlib xml.ElementTree 更有效.

I would recommend using lxml as it is in most regards more efficient than stdlib xml.ElementTree.

由于整个文档太大,您不应尝试将其作为一个整体解析,而应迭代地接近源文档.

You shall not attempt to parse the whole document as a whole as it is too large, but should approach the source document iteratively.

lxml 页面是 事件驱动解析

有两种选择:

  • etree.iterparse
  • 使用自定义解析器,触发类似 SAX 的事件

我个人更喜欢 etree.iterparse,因为它以更方便的方式为您解析元素.但是一定不要忘记对处理过的部分进行清理工作,否则与一次解析整个文档相比,不会节省任何内存.

I personally prefer the etree.iterparse as it gives you parsed elements in much more convenient way. But you must not forget to do the clean-up work on processed parts, otherwise you will not save any memory comparing to parsing the whole document at once.

添加真实示例

示例比大量理论更能说明问题.这是我的尝试:

Example talks better then tons of theories. Here is my attempt:

from lxml import etree

# fname = "large.xml"  # 78 MB
fname = "verylarge.xml"  # 773 MB

toremove = []

for event, element in etree.iterparse(fname):
    if element.tag == "TableRow":
        if element.attrib["RowCount"] != "0":
            element.clear()
            # removing current element causes segmentation fault
            # element.getparent().remove(element)
            toremove.append(element)
    if element.tag == "TableData":
        for rowelm in toremove:
            element.remove(rowelm)
        toremove = []

# last processed element is the root one
with open("out.xml", "w") as f:
    f.write(etree.tostring(element))

为了测试性能,我拿了你的大样本文件(73 MB),重复内部部分 10 次,得到773 MB 的大型 XML 文件并对其进行了处理.

To test the performance, I took your large sample file (73 MB), repeated inner part 10 times, got 773 MB large XML file and processed that.

处理耗时 24 秒(zenbook core i7,4 GB RAM),生成的文件大小为 4.7 MB.

The processing took 24 seconds (zenbook core i7 with 4 GB RAM) and resulting file was 4.7 MB large.

iterparse 默认只提供end"事件,当某个元素被完全解析时触发.

iterparse is by default providing only "end" events, fired when some element is completely parsed.

这个解决方案使用的事实是,即使使用 iterparse,元素也保存在内存中.这是用在以下地方:

This solution uses the fact, that even with iterparse, the elements are kept in memory. This is used in following places:

  • 在 iterparse 期间,不需要的元素被清除 (element.clear()) 并删除(element.remove(rowelm)).clear() 清除元素的内部内容,但元素仍然存在.remove() 作用于父元素并从中移除内部部分.
  • 要使用的元素不会被删除和清除,所以我们在最后发现它们存在于根元素.
  • 最后,当所有处理完后,最后处理的 element 是根元素.还在记忆中,所以我可以将它作为字符串写入文件.
  • during iterparse, not needed elements are cleared (element.clear()) and removed (element.remove(rowelm)). The clear() clears the inner content of the element, but the element still exists. The remove() works on parent element and removes the inner part from it.
  • elements which are to be used are not removed and cleared, so we find them at the end present in the root element.
  • finally, when all is processed, last processed element is the root one. It is still in memory, so I can write it as string to a file.

remove() 元素时必须小心.试图从父元素中删除元素当前迭代元素的那一刻导致分段错误.为此,代码等待 "TableRow" 元素 remove() 直到我们完成父 TableData 元素的解析.

One has to be careful when to remove() the element. Trying to remove the element from parent at the moment it was currently iterated element caused segmentation fault. For this reason the code waits with "TableRow" element remove() until we complete parsing of parent TableData element.

变量 toremove 用于收集所有 "TableRow" 元素,并在父元素中立即使用"TableData" 元素被完全解析.请注意, remove() 仅适用于真实元素父母,所以我们一定要在适当的时候做.

Variable toremove is used to collect all "TableRow" elements and is used as soon as parent "TableData" element is completely parsed. Note, that remove() works only on real element parents, so we shall be sure we do it in proper time.

对于更大的文件,此解决方案将受到生成的 XML 文档大小的限制,因为它是保存在内存中,直到源 XML 的修剪完成.

For even larger files, this solution would be limited by size of resulting XML document as it is kept in memory till the pruning of the source XML is completed.

对于这种情况,我们将不得不在解析和摆脱时使用写出输出内存中已经处理的所有元素.在实践中,你必须写出开始"事件时的打开 XML 元素"部分(如 "<TaskReportSummary att="a" otheratt="bb")将出现,并在结束"事件中写入结束 XML 元素部分 "/>".

For such scenarios, we would have to use writing out the output during parsing and getting rid of all elements in memory, which are already processed. In practice, you would have to write out "opening XML element" part (like "<TaskReportSummary att="a" otheratt="bb") when "start" event would appear, and write clossing XML element part "/>" at "end" event.

这篇关于从大型 xml 文件中修剪一些元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆