在python中使用lxml iterparse解析大型.bz2文件(40 GB).未压缩文件未出现的错误 [英] Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

查看:86
本文介绍了在python中使用lxml iterparse解析大型.bz2文件(40 GB).未压缩文件未出现的错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析以bz2格式压缩的OpenStreetMap的planet.osm.因为它已经是41G,所以我不想完全解压缩文件.

I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. Because it is already 41G, I don't want to decompress the file completely.

所以我想出了如何使用以下代码使用bz2和lxml来解析planet.osm文件的各个部分

So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code

from lxml import etree as et
from bz2 import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Geofabrick提取物完美配合.但是,当我尝试使用相同的脚本解析planet-latest.osm.bz2时,出现错误消息:

which works perfectly with the Geofabrick extracts. However, when I try to parse the planet-latest.osm.bz2 with the same script I get the error:

xml.etree.XMLSyntaxError:属性num_change的规范要求值,第3684行,第60列

xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60

这是我尝试过的事情:

  • 检查planet-latest.osm.bz2 md5sum
  • 检查使用bz2的脚本停止所在的planet-latest.osm.没有明显的错误,该属性被称为"num_changes",而不是错误所指示的"num_changes"
  • 我也做了一些愚蠢的事情,但是这个错误使我感到困惑:我在'rb'模式下打开了planet-latest.osm.bz2 [c = BZ2File('file.osm.bz2','rb')],然后将c.read()传递给iterparse(),这向我返回了一个错误消息,指出(很长的字符串)无法打开.奇怪的东西(很长的字符串)在规范要求值"错误所指的地方正确结束...

然后我尝试使用一个简单的方法首先解压缩planet.osm.gz2

Then I tried to decompress first the planet.osm.gz2 usin a simple

bzcat planet.osm.gz2 > planet.osm

和直接运行该解析器对planet.osm.而且...有效!我对此感到非常困惑,并且找不到任何指向为什么会发生这种情况以及如何解决这一问题的指针.我的猜测是在解压缩和解析之间会发生某种情况,但是我不确定.请帮助我理解!

And ran the parser directly on planet.osm. And... it worked! I am very puzzled by this, and could not find any pointer to why this may happen and how to solve this. My guess would be there is something going on between the decompression and the parsing, but I am not sure. Please help me understand!

推荐答案

事实证明问题出在压缩的planet.osm文件上.

It turns out that the problem is with the compressed planet.osm file.

OSM Wiki 所示,行星文件被压缩为多流文件,并且bz2 python模块无法读取多流文件.但是,bz2文档指示了可以读取此类文件的替代模块 bz2file .我用过它,效果很好!

As indicated on the OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file. I used it and it works perfectly!

因此代码应显示为:

from lxml import etree as et
from bz2file import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

此外,根据使用PBF格式进行了一些研究(如评论中所述),我偶然发现了 imposm.parser imposm.parser ,这是一个实现OSM数据(pbf或xml格式)的通用解析器的python模块.您可能想看看这个!

Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser, a python module that implements a generic parser for OSM data (in pbf or xml format). You may want to have a look at this!

这篇关于在python中使用lxml iterparse解析大型.bz2文件(40 GB).未压缩文件未出现的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆