ElementTree(1.3.0)Python中XML解析的有效方法 [英] Efficient way of XML parsing in ElementTree(1.3.0) Python

查看:93
本文介绍了ElementTree(1.3.0)Python中XML解析的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析范围从(20MB-3GB)的巨大XML文件.文件是来自不同Instrumentation的示例.因此,我正在做的是从文件中找到必要的元素信息并将其插入数据库(Django).

I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django).

我的文件样本的很小一部分.命名空间存在于所有文件中.文件有趣的功能是它们具有更多节点属性,然后具有文本

Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text

<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">

    <instrumentConfiguration id="QTOF">
                    <cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
                    <componentList count="4">
                            <source order="1">
                                    <cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
                            </source>
                            <analyzer order="2">
                                    <cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
                            </analyzer>
                            <analyzer order="3">
                                    <cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
                            </analyzer>
                            <detector order="4">
                                    <cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
                            </detector>
                    </componentList>
     </instrumentConfiguration>

小而完整的文件是

Small but complete file is here

所以到目前为止,我所做的是对所有感兴趣的元素都使用findall.

So what I have done till now is using findall for every element of interest.

import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
    insattrib=s[ins].attrib
    # It will print out all the id attribute of instrument
    print insattrib["id"] 

如何访问instrumentConfiguration(s)元素的所有子代/孙代?

s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')

<强>的我想要

InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector

存在命名空间时,是否存在有效的解析元素/subelement/subelement的方法?还是我每次都必须使用find/findall来访问具有名称空间的树中的特定元素?这只是一个小例子,我必须解析更复杂的元素层次结构.

Is there efficient way of parsing element/subelement/subelement when namespace exist? Or do I have to use find/findall every time to access particular element in the tree with namespace? This is just a small example I have to parse more complex element hierarchy.

任何建议!

修改

没得到正确的答案,从而不得不编辑再来一次!

Didn't got the correct answer so have to edit once more!

推荐答案

下面是一个脚本,该脚本可以在40秒内(在我的机器上)解析一百万个<instrumentConfiguration/>元素(967MB文件),而不会占用大量内存.

Here's a script that parses one million <instrumentConfiguration/> elements (967MB file) in 40 seconds (on my machine) without consuming large amount of memory.

吞吐量是. cElementTree page (2005) 报告47MB/s.

#!/usr/bin/env python
from itertools import imap, islice, izip
from operator  import itemgetter
from xml.etree import cElementTree as etree

def parsexml(filename):
    it = imap(itemgetter(1),
              iter(etree.iterparse(filename, events=('start',))))
    root = next(it) # get root element
    for elem in it:
        if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
            values = [('Id', elem.get('id')),
                      ('Parameter1', next(it).get('name'))] # cvParam
            componentList_count = int(next(it).get('count'))
            for parent, child in islice(izip(it, it), componentList_count):
                key = parent.tag.partition('}')[2]
                value = child.get('name')
                assert child.tag.endswith('cvParam')
                values.append((key, value))
            yield values
            root.clear() # preserve memory

def print_values(it):
    for line in (': '.join(val) for conf in it for val in conf):
        print(line)

print_values(parsexml(filename))

输出

$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps

注意:代码易碎,它假定<instrumentConfiguration/>的前两个子代是<cvParam/><componentList/>,并且所有值都可以用作标记名或属性.

Note: The code is fragile it assumes that the first two children of <instrumentConfiguration/> are <cvParam/> and <componentList/> and all values are available as tag names or attributes.

在这种情况下,ElementTree 1.3比cElementTree 1.0.6慢6倍.

ElementTree 1.3 is ~6 times slower than cElementTree 1.0.6 in this case.

如果将root.clear()替换为elem.clear(),则代码速度提高了约10%,但内存却增加了约10倍. lxml.etreeelem.clear()变体一起使用,性能与cElementTree相同,但是消耗的内存(500MB)是它的20(root.clear())/2(elem.clear())倍.

If you replace root.clear() by elem.clear() then the code is ~10% faster but ~10 times more memory. lxml.etree works with elem.clear() variant, the performance is the same as for cElementTree but it consumes 20 (root.clear()) / 2 (elem.clear()) times as much memory (500MB).

这篇关于ElementTree(1.3.0)Python中XML解析的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆