泡菜文件太大,无法加载 [英] Pickle File too large to load

查看:98
本文介绍了泡菜文件太大,无法加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到的问题是我有一个很大的泡菜文件(2.6 Gb),我想打开该文件,但每次这样做都会出现内存错误.我现在意识到我应该使用数据库来存储所有信息,但是现在为时已晚.泡菜文件包含来自美国国会记录的日期和文本,这些记录和文本是从互联网上抓取的(运行大约需要2周).有什么方法可以访问递增地转储到pickle文件中的信息,也可以将pickle文件转换为sql数据库,或者可以在不需要重新输入所有数据的情况下打开其他内容.我真的不想再花2个星期重新检索国会记录并将数据输入数据库.

The problem that I am having is that I have a very large pickle file (2.6 Gb) that I am trying to open but each time I do so I get a memory error. I realize now that I should have used a database to store all the information but its too late now. The pickle file contains dates and text from the U.S. Congressional record that was crawled from the internet (took about 2 weeks to run). Is there any way I can access the information that I dumped into the pickle file incrementally or a way to convert the pickle file into a sql database or something else that I can open without having to re-input all the data. I really dont want to have to spend another 2 weeks re-crawling the congressional record and imputing the data into a database.

非常感谢您的帮助

编辑*

如何腌制对象的代码:

def save_objects(objects): 
    with open('objects.pkl', 'wb') as output: 
        pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)

def Main():   
    Links()
    file = open("datafile.txt", "w")
    objects=[]
    with open('links2.txt', 'rb') as infile:
        for link in infile: 
            print link
            title,text,date=Get_full_text(link)
            article=Doccument(title,date,text)
            if text != None:
                write_to_text(date,text)
                objects.append(article)
                save_objects(objects)

这是带有错误的程序:

def Main():
    file= open('objects1.pkl', 'rb') 
    object = pickle.load(file)

推荐答案

好像您有点泡菜! ;-).希望在此之后,您将永远不会使用泡菜.这不是一种很好的数据存储格式.

Looks like you're in a bit of a pickle! ;-). Hopefully after this, you'll NEVER USE PICKLE EVER. It's just not a very good data storage format.

无论如何,对于这个答案,我假设您的Document类看起来像这样.如果没有,请用您的实际Document类发表评论:

Anyways, for this answer I'm assuming your Document class looks a bit like this. If not, comment with your actual Document class:

class Document(object): # <-- object part is very important! If it's not there, the format is different!
    def __init__(self, title, date, text): # assuming all strings
        self.title = title
        self.date = date
        self.text = text

无论如何,我通过此类制作了一些简单的测试数据:

Anyways, I made some simple test data with this class:

d = [Document(title='foo', text='foo is good', date='1/1/1'), Document(title='bar', text='bar is better', date='2/2/2'), Document(title='baz', text='no one likes baz :(', date='3/3/3')]

以格式2(对于Python 2.x为pickle.HIGHEST_PROTOCOL)将其腌制

Pickled it with format 2 (pickle.HIGHEST_PROTOCOL for Python 2.x)

>>> s = pickle.dumps(d, 2)
>>> s
'\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06U\rbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'

并用pickletools拆解:

>>> pickletools.dis(s)
    0: \x80 PROTO      2
    2: ]    EMPTY_LIST
    3: q    BINPUT     0
    5: (    MARK
    6: c        GLOBAL     '__main__ Document'
   25: q        BINPUT     1
   27: )        EMPTY_TUPLE
   28: \x81     NEWOBJ
   29: q        BINPUT     2
   31: }        EMPTY_DICT
   32: q        BINPUT     3
   34: (        MARK
   35: U            SHORT_BINSTRING 'date'
   41: q            BINPUT     4
   43: U            SHORT_BINSTRING '1/1/1'
   50: q            BINPUT     5
   52: U            SHORT_BINSTRING 'text'
   58: q            BINPUT     6
   60: U            SHORT_BINSTRING 'foo is good'
   73: q            BINPUT     7
   75: U            SHORT_BINSTRING 'title'
   82: q            BINPUT     8
   84: U            SHORT_BINSTRING 'foo'
   89: q            BINPUT     9
   91: u            SETITEMS   (MARK at 34)
   92: b        BUILD
   93: h        BINGET     1
   95: )        EMPTY_TUPLE
   96: \x81     NEWOBJ
   97: q        BINPUT     10
   99: }        EMPTY_DICT
  100: q        BINPUT     11
  102: (        MARK
  103: h            BINGET     4
  105: U            SHORT_BINSTRING '2/2/2'
  112: q            BINPUT     12
  114: h            BINGET     6
  116: U            SHORT_BINSTRING 'bar is better'
  131: q            BINPUT     13
  133: h            BINGET     8
  135: U            SHORT_BINSTRING 'bar'
  140: q            BINPUT     14
  142: u            SETITEMS   (MARK at 102)
  143: b        BUILD
  144: h        BINGET     1
  146: )        EMPTY_TUPLE
  147: \x81     NEWOBJ
  148: q        BINPUT     15
  150: }        EMPTY_DICT
  151: q        BINPUT     16
  153: (        MARK
  154: h            BINGET     4
  156: U            SHORT_BINSTRING '3/3/3'
  163: q            BINPUT     17
  165: h            BINGET     6
  167: U            SHORT_BINSTRING 'no one likes baz :('
  188: q            BINPUT     18
  190: h            BINGET     8
  192: U            SHORT_BINSTRING 'baz'
  197: q            BINPUT     19
  199: u            SETITEMS   (MARK at 153)
  200: b        BUILD
  201: e        APPENDS    (MARK at 5)
  202: .    STOP

看起来很复杂!但实际上,还不错. pickle基本上是一台堆栈计算机,您看到的每个ALL_CAPS标识符都是一个 opcode ,它以某种方式操纵内部堆栈"以进行解码.如果我们试图解析一些复杂的结构,这将更为重要,但是幸运的是,我们只是在编写一个基本元组的简单列表.这些代码"所要做的就是在堆栈上构造一堆对象,然后将整个堆栈推入列表.

Looks complex! But really, it's not so bad. pickle is basically a stack machine, each ALL_CAPS identifier you see is an opcode, which manipulates the internal "stack" in some way for decoding. If we were trying to parse some complex structure, this would be more important, but luckily we're just making a simple list of essentially-tuples. All this "code" is doing is constructing a bunch of objects on the stack, and then pushing the entire stack into a list.

我们要做的一件事就是您看到的"BINPUT"/"BINGET"操作码.基本上,这些是为了内存化",以减少数据占用,pickleBINPUT <id>保存字符串,然后如果再次出现它们,而不是重新转储它们,只需放置一个BINGET <id>即可从字符串中检索它们.缓存.

The one thing we DO need to care about are the 'BINPUT' / 'BINGET' opcodes you see scattered around. Basically, these are for 'memoization', to reduce data footprint, pickle saves strings with BINPUT <id>, and then if they come up again, instead of re-dumping them, simply puts a BINGET <id> to retrieve them from the cache.

另外,还有一个并发症!不仅是SHORT_BINSTRING-正常的BINSTRING字符串> 256字节,还有一些有趣的unicode变体.我只是假设您将Python 2与所有ASCII字符串一起使用.再次,如果这不是正确的假设,请发表评论.

Also, another complication! There's more than just SHORT_BINSTRING - there's normal BINSTRING for strings > 256 bytes, and also some fun unicode variants as well. I'll just assume that you're using Python 2 with all ASCII strings. Again, comment if this isn't a correct assumption.

确定,因此我们需要流式传输文件,直到命中一个'\ 81'字节(NEWOBJ).然后,我们需要向前扫描,直到我们击中一个'('(MARK)字符.然后,直到我们击中一个'u'(SETITEMS),我们才能读取键/值字符串对-应该有3对总计,每个字段一个.

OK, so we need to stream the file until we hit a '\81' bytes (NEWOBJ). Then, we need to scan forward until we hit a '(' (MARK) character. Then, until we hit a 'u' (SETITEMS), we read pairs of key/value strings - there should be 3 pairs total, one for each field.

所以,让我们这样做.这是我的脚本,用于以流方式读取泡菜数据.这远非完美,因为我只是将它破解在一起以获得答案,因此您需要对其进行大量修改,但这是一个不错的开始.

So, lets do this. Here's my script to read pickle data in streaming fashion. It's far from perfect, since I just hacked it together for this answer, and you'll need to modify it a lot, but it's a good start.

pickledata = '\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06T\x14\x05\x00\x00bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'

# simulate a file here
import StringIO
picklefile = StringIO.StringIO(pickledata)

import pickle # just for opcode names
import struct # binary unpacking

def try_memo(f, v, cache):
    opcode = f.read(1)
    if opcode == pickle.BINPUT:
        cache[f.read(1)] = v
    elif opcode == pickle.LONG_BINPUT:
        print 'skipping LONG_BINPUT to save memory, LONG_BINGET will probably not be used'
        f.read(4)
    else:
        f.seek(f.tell() - 1) # rewind

def try_read_string(f, opcode, cache):
    if opcode in [ pickle.SHORT_BINSTRING, pickle.BINSTRING ]:
        length_type = 'b' if opcode == pickle.SHORT_BINSTRING else 'i'
        str_length = struct.unpack(length_type, f.read(struct.calcsize(length_type)))[0]
        value = f.read(str_length)
        try_memo(f, value, memo_cache)
        return value
    elif opcode == pickle.BINGET:
        return memo_cache[f.read(1)]
    elif opcide == pickle.LONG_BINGET:
        raise Exception('Unexpected LONG_BINGET? Key ' + f.read(4))
    else:
        raise Exception('Invalid opcode ' + opcode + ' at pos ' + str(f.tell()))

memo_cache = {}
while True:
    c = picklefile.read(1)
    if c == pickle.NEWOBJ:
        while picklefile.read(1) != pickle.MARK:
            pass # scan forward to field instantiation
        fields = {}
        while True:
            opcode = picklefile.read(1)
            if opcode == pickle.SETITEMS:
                break
            key = try_read_string(picklefile, opcode, memo_cache)
            value = try_read_string(picklefile, picklefile.read(1), memo_cache)
            fields[key] = value
        print 'Document', fields
        # insert to sqllite
    elif c == pickle.STOP:
        break

这可以正确读取泡菜格式2(修改为具有长字符串)的测试数据:

This correctly reads my test data in pickle format 2 (modified to have a long string):

$ python picklereader.py
Document {'date': '1/1/1', 'text': 'foo is good', 'title': 'foo'}
Document {'date': '2/2/2', 'text': 'bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is better', 'title': 'bar'}
Document {'date': '3/3/3', 'text': 'no one likes baz :(', 'title': 'baz'}

祝你好运!

这篇关于泡菜文件太大,无法加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆