ElementTree(1.3.0)Python中XML解析的有效方法 [英] Efficient way of XML parsing in ElementTree(1.3.0) Python
问题描述
我正在尝试解析范围从(20MB-3GB)的巨大XML文件.文件是来自不同Instrumentation的示例.因此,我正在做的是从文件中找到必要的元素信息并将其插入数据库(Django).
I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django).
我的文件样本的很小一部分.命名空间存在于所有文件中.文件有趣的功能是它们具有更多节点属性,然后具有文本
Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text
<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">
<instrumentConfiguration id="QTOF">
<cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
<componentList count="4">
<source order="1">
<cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
</source>
<analyzer order="2">
<cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
</analyzer>
<analyzer order="3">
<cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
</analyzer>
<detector order="4">
<cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
</detector>
</componentList>
</instrumentConfiguration>
Small but complete file is here
所以到目前为止,我所做的是对所有感兴趣的元素都使用findall.
So what I have done till now is using findall for every element of interest.
import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
insattrib=s[ins].attrib
# It will print out all the id attribute of instrument
print insattrib["id"]
如何访问instrumentConfiguration(s)元素的所有子代/孙代?
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
<强>的我想要强>
InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector
存在命名空间时,是否存在有效的解析元素/subelement/subelement的方法?还是我每次都必须使用find/findall来访问具有名称空间的树中的特定元素?这只是一个小例子,我必须解析更复杂的元素层次结构.
Is there efficient way of parsing element/subelement/subelement when namespace exist? Or do I have to use find/findall every time to access particular element in the tree with namespace? This is just a small example I have to parse more complex element hierarchy.
任何建议!
修改
没得到正确的答案,从而不得不编辑再来一次!
Didn't got the correct answer so have to edit once more!
推荐答案
下面是一个脚本,该脚本可以在40
秒内(在我的机器上)解析一百万个<instrumentConfiguration/>
元素(967MB
文件),而不会占用大量内存.
Here's a script that parses one million <instrumentConfiguration/>
elements (967MB
file) in 40
seconds (on my machine) without consuming large amount of memory.
吞吐量是cElementTree page (2005)
报告47MB/s
.
#!/usr/bin/env python
from itertools import imap, islice, izip
from operator import itemgetter
from xml.etree import cElementTree as etree
def parsexml(filename):
it = imap(itemgetter(1),
iter(etree.iterparse(filename, events=('start',))))
root = next(it) # get root element
for elem in it:
if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
values = [('Id', elem.get('id')),
('Parameter1', next(it).get('name'))] # cvParam
componentList_count = int(next(it).get('count'))
for parent, child in islice(izip(it, it), componentList_count):
key = parent.tag.partition('}')[2]
value = child.get('name')
assert child.tag.endswith('cvParam')
values.append((key, value))
yield values
root.clear() # preserve memory
def print_values(it):
for line in (': '.join(val) for conf in it for val in conf):
print(line)
print_values(parsexml(filename))
输出
$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps
注意:代码易碎,它假定<instrumentConfiguration/>
的前两个子代是<cvParam/>
和<componentList/>
,并且所有值都可以用作标记名或属性.
Note: The code is fragile it assumes that the first two children of <instrumentConfiguration/>
are <cvParam/>
and <componentList/>
and all values are available as tag names or attributes.
在这种情况下,ElementTree 1.3比cElementTree 1.0.6慢6倍.
ElementTree 1.3 is ~6 times slower than cElementTree 1.0.6 in this case.
如果将root.clear()
替换为elem.clear()
,则代码速度提高了约10%,但内存却增加了约10倍. lxml.etree
与elem.clear()
变体一起使用,性能与cElementTree
相同,但是消耗的内存(500MB)是它的20(root.clear()
)/2(elem.clear()
)倍.
If you replace root.clear()
by elem.clear()
then the code is ~10% faster but ~10 times more memory. lxml.etree
works with elem.clear()
variant, the performance is the same as for cElementTree
but it consumes 20 (root.clear()
) / 2 (elem.clear()
) times as much memory (500MB).
这篇关于ElementTree(1.3.0)Python中XML解析的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!