使用xmltree解析大型python xml [英] Parse large python xml using xmltree

查看:124
本文介绍了使用xmltree解析大型python xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个python脚本来分析巨大的xml文件(最大的一个是446 MB)

  try:
parser = etree.XMLParser(encoding ='utf-8')
tree = etree.parse(os.path.join(srcDir,fileName),parser)$ b $ root root = tree.getroot()
除了Exception,e:
print错误解析文件+ str(文件名)+Reason+ str(e.message)

为root中的孩子:
如果PersonName在child.tag中:
personName = child.text

这就是我的XML看起来像:

 <?xml version =1.0encoding =utf-8?> 
<别名权威=OPPxmlns =http://www.example.org/yml/data/commonv2>
<说明> myData< /说明>
<标识符> 43hhjh87n4nm< /标识符>
< /别名>
< RollNo uom =kPa> 39979172.201167159< / RollNo>
< PersonName> Miracle Smith< / PersonName>
<日期> 2017-06-02T01:10:3​​2-05:00< / Date>
....

我想要做的就是获取PersonName标签内容多数民众赞成在所有。其他标签我不在乎。



可悲的是我的文件非常庞大,当我使用上面的代码时,我总是收到这个错误:

 解析文件2eb6d894-0775-e611.xml时出错原因未知错误,第1行,第310915857列
解析文件2ecc18b5-ef41-e711-80f.xml时出错解释原因在文档第1行第3428182列
分析文件2f0d6926-b602-e711-80f4-005.xml时出错原因文档末尾第1行第6162118列的额外内容
解析文件2f12636b-b2f5时出错-e611-80f3-00.xml原因文档末尾第1行8014679处的额外内容
分析文件2f14e35a-d22b-4504-8866-.xml时出错原因文档末尾的额外内容,第1行8411238
解析文件2f50c2eb-55c6-e611-80f0-005056a.xml时出错原因文档末尾第1行7636614列的额外内容
解析文件3a1a3806-b6af-e611时出错-80ef-00505.xml原因文档末尾第1行11032486列出的额外内容

我的XML非常好,没有额外的内容。看到大文件解析导致错误。
我已经看过iterparse(),但它似乎很复杂,因为它提供了对整个DOM的解析,而我只想要一个在根下的标记。此外,不给我一个很好的示例,以标签名称获得正确的值?



我应该使用正则表达式parse还是grep / awk方式来执行此操作?或者对我的代码进行任何调整都可以让我在这些大文件中获取人名?



更新:
尝试此示例,它似乎是从xml打印整个世界,除了我的标记?



是否从文件的底部到顶部读取文件?在这种情况下,需要很长时间才能到达顶端,即我的PersonName标签?我尝试改变下面的行来阅读结尾来开始events =(end,start),它也做同样的事情!!!

  path = [] 
for event,ele.in ET.iterparse('D:\\mystage\\2-80ea-005056.xml',events =(start, end)):
if event =='start':
path.append(elem.tag)
elif event =='end':
#处理标签
打印elem.text //打印整个世界
if elem.tag =='PersonName':
打印elem.text
path.pop()
code>


解决方案

在这种情况下,使用Iterparse并不难。



temp.xml 是您的问题中显示的文件,其中包含< / MyRoot>

将锅炉台的 source = 看作boilerplace,如果您愿意的话,它会解析xml文件并返回它的块指示块是元素的开始还是结束,并提供有关元素的信息。



在这种情况下,我们需要只考虑'开始'事件。我们注意'PersonName'标签并拿起他们的文本。在xml文件中找到唯一这样的项目,我们就放弃了处理。

 >>> from xml.etree import ElementTree 
>>> source = iter(ElementTree.iterparse('temp.xml',events =('start','end')))
>>> for an_event,an_element in source:
... if an_event =='start'and an_element.tag.endswith('PersonName'):
... an_element.text
...打破
...
'奇迹史密斯'

编辑,回应问题在评论中:



通常情况下,你不会这样做,因为 iterparse 旨在用于大块XML。但是,通过在 StringIO 对象中包装一个字符串,它可以用 iterparse 来处理。

 >>> from xml.etree import ElementTree 
>>> from io import StringIO
>>> xml = StringIO('''\
...<?xml version =1.0encoding =utf-8?>
...< MyRoot xmlns:xsi = http://www.w3.org/2001/XMLSchema-instancexmlns:xsd =http://www.w3.org/2001/XMLSchemauuid =ertrxmlns =http://www.example .org / yml / data / litsmlv2>
...< Aliases authority =OPPxmlns =http://www.example.org/yml/data/commonv2>
...<描述> myData< /描述>
...<标识符> 43hhjh87n4nm< /标识符>
...< /别名>
...< < / RollNo>
...< PersonName> Miracle Smith< / PersonName>
...< Date> 2017-06-02T01:10: 32-05:00< / Date>
...< / MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml,events =('start','end')))
>>> for an_event,an_element in source:
... if an_event =='start'and an_element.tag.endswith('PersonName'):
... an_element.text
...打破
...
'奇迹史密斯'


I have a python script that parses huge xml files ( largest one is 446 MB)

    try:
        parser = etree.XMLParser(encoding='utf-8')
        tree = etree.parse(os.path.join(srcDir, fileName), parser)
        root = tree.getroot()
    except Exception, e:
        print "Error parsing file "+str(fileName) + " Reason "+str(e.message)

    for child in root:
        if "PersonName" in child.tag:
            personName = child.text

This is what my xml looks like :

<?xml version="1.0" encoding="utf-8"?>
<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
  <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
     <Description>myData</Description>
     <Identifier>43hhjh87n4nm</Identifier>
  </Aliases>
  <RollNo uom="kPa">39979172.201167159</RollNo>
  <PersonName>Miracle Smith</PersonName>
  <Date>2017-06-02T01:10:32-05:00</Date>
....

All I want to do is get the PersonName tags contents thats all. Other tags I don't care about.

Sadly My files are huge and I keep getting this error when I use the code above :

Error parsing file 2eb6d894-0775-e611.xml Reason unknown error, line 1, column 310915857
Error parsing file 2ecc18b5-ef41-e711-80f.xml Reason Extra content at the end of the document, line 1, column 3428182
Error parsing file 2f0d6926-b602-e711-80f4-005.xml Reason Extra content at the end of the document, line 1, column 6162118
Error parsing file 2f12636b-b2f5-e611-80f3-00.xml Reason Extra content at the end of the document, line 1, column 8014679
Error parsing file 2f14e35a-d22b-4504-8866-.xml Reason Extra content at the end of the document, line 1, column 8411238
Error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml Reason Extra content at the end of the document, line 1, column 7636614
Error parsing file 3a1a3806-b6af-e611-80ef-00505.xml Reason Extra content at the end of the document, line 1, column 11032486

My XML is perfectly fine and has no extra content .Seems that the large files parsing causes the error. I have looked at iterparse() but it seems to complex for what I want to achieve as it provides parsing of the whole DOM while I just want that one tag that is under the root. Also , does not give me a good sample to get the correct value by tag name ?

Should I use a regex parse or grep /awk way to do this ? Or any tweak to my code will let me get the Person name in these huge files ?

UPDATE: Tried this sample and it seems to be printing the whole world from the xml except my tag ?

Does iterparse read from bottom to top of file ? In that case it will take a long time to get to the top i.e my PersonName Tag ? I tried changing the line below to read end to start events=("end", "start") and it does the same thing !!!

path = []
for event, elem in ET.iterparse('D:\\mystage\\2-80ea-005056.xml', events=("start", "end")):
    if event == 'start':
            path.append(elem.tag)
    elif event == 'end':
            # process the tag
            print elem.text  // prints whole world 
            if elem.tag == 'PersonName':
                print elem.text
            path.pop()

解决方案

Iterparse is not that difficult to use in this case.

temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end.

Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element.

In this case we need consider only the 'start' events. We watch for the 'PersonName' tags and pick up their texts. Having found the one and only such item in the xml file we abandon the processing.

>>> from xml.etree import ElementTree
>>> source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))
>>> for an_event, an_element in source:
...     if an_event=='start' and an_element.tag.endswith('PersonName'):
...         an_element.text
...         break
... 
'Miracle Smith'

Edit, in response to question in a comment:

Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. However, by wrapping a string in a StringIO object it can be processed with iterparse.

>>> from xml.etree import ElementTree
>>> from io import StringIO
>>> xml = StringIO('''\
... <?xml version="1.0" encoding="utf-8"?>
... <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
...   <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
...        <Description>myData</Description>
...             <Identifier>43hhjh87n4nm</Identifier>
...               </Aliases>
...                 <RollNo uom="kPa">39979172.201167159</RollNo>
...                   <PersonName>Miracle Smith</PersonName>
...                     <Date>2017-06-02T01:10:32-05:00</Date>
... </MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml, events=('start', 'end')))
>>> for an_event, an_element in source:
...     if an_event=='start' and an_element.tag.endswith('PersonName'):
...         an_element.text
...         break
...     
'Miracle Smith'

这篇关于使用xmltree解析大型python xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆