在Python中的持久记忆 [英] Persistent memoization in Python

查看:205
本文介绍了在Python中的持久记忆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个昂贵的函数,它接受并返回少量的数据(一些整数和浮点数)。我已经记住了这个功能,但我想让备忘录持续。已经有几个线程与此相关,但我不确定一些建议的方法的潜在问题,我有一些相当具体的要求:

I have an expensive function that takes and returns a small amount of data (a few integers and floats). I have already memoized this function, but I would like to make the memo persistent. There are already a couple of threads relating to this, but I'm unsure about potential issues with some of the suggested approaches, and I have some fairly specific requirements:


  • 我一定会同时使用来自多个线程和进程的函数(使用 multiprocessing 和从单独的python脚本)

  • 我不需要从这个python函数之外读取或写访问备忘录

  • 我不担心备忘录在极少的情况下被损坏(如拉插头或意外写到文件没有锁定它),因为它不是昂贵重建(通常是10-20分钟),但我宁愿如果它不会被破坏,因为异常,或手动终止一个python过程(我不知道这是多么现实)

  • 我强烈喜欢不需要大型外部库的解决方案,因为我在一台机器上有非常有限的硬盘空间将运行
  • 上的代码
  • 我对跨平台代码有较弱的偏好,但我很可能只在Linux上使用

  • I will definitely use the function from multiple threads and processes simultaneously (both using multiprocessing and from separate python scripts)
  • I will not need read or write access to the memo from outside this python function
  • I am not that concerned about the memo being corrupted on rare occasions (like pulling the plug or accidentally writing to the file without locking it) as it isn't that expensive to rebuild (typically 10-20 minutes) but I would prefer if it would not be corrupted because of exceptions, or manually terminating a python process (I don't know how realistic that is)
  • I would strongly prefer solutions that don't require large external libraries as I have a severely limited amount of hard disk space on one machine I will be running the code on
  • I have a weak preference for cross-platform code, but I will likely only use this on Linux

此主题讨论了 shelve 模块,这显然不是进程安全的。其中两个答案建议使用 fcntl.flock 锁定搁置文件。但是,此主题中的一些回复似乎表明,这充满了问题 - 但我不知道他们是什么。它听起来好像只限于Unix(虽然显然Windows有一个等价的 msvcrt.locking ),锁只是咨询 - 即,它不会阻止我意外写入文件,而不检查它是否被锁定。还有其他潜在问题吗?写入文件副本,并将主副本替换为最后一步,可降低损坏的风险?

This thread discusses the shelve module, which is apparently not process-safe. Two of the answers suggest using fcntl.flock to lock the shelve file. Some of the responses in this thread, however, seem to suggest that this is fraught with problems - but I'm not exactly sure what they are. It sounds as though this is limited to Unix (though apparently Windows has an equivalent called msvcrt.locking), and the lock is only 'advisory' - i.e., it won't stop me from accidentally writing to the file without checking it is locked. Are there any other potential problems? Would writing to a copy of the file, and replacing the master copy as a final step, reduce the risk of corruption?

它看起来不像 dbm模块将做任何比搁置更好。我快速浏览了 sqlite3 ,但是为了这个目的似乎有点过分。 此主题这一个提到了几个第三方库,包括 ZODB ,但是有很多选择,对于这个任务来说,它们都显得过于庞大和复杂。

It doesn't look as though the dbm module will do any better than shelve. I've had a quick look at sqlite3, but it seems a bit overkill for this purpose. This thread and this one mention several 3rd party libraries, including ZODB, but there are a lot of choices, and they all seem overly large and complicated for this task.

任何人都有任何建议?

UPDATE :kindall提到IncPy下面,这看起来很有趣。不幸的是,我不想回到Python 2.6(我实际上使用3.2),并且看起来像使用C库有点尴尬(我大量使用numpy和scipy等)。

UPDATE: kindall mentioned IncPy below, which does look very interesting. Unfortunately, I wouldn't want to move back to Python 2.6 (I'm actually using 3.2), and it looks like it is a bit awkward to use with C libraries (I make heavy use of numpy and scipy, among others).

kindall的另一个想法是有启发性的,但我认为将这适应于多个进程将有点困难 - 我想最简单的替换队列与文件锁定或数据库。

kindall's other idea is instructive, but I think adapting this to multiple processes would be a little difficult - I suppose it would be easiest to replace the queue with file locking or a database.

再次查看ZODB,它确实看起来完美的任务,但我真的想避免使用任何额外的库。我仍然不完全确定所有的问题,简单地使用 flock 是 - 我想象一个大问题是,如果一个进程在写入文件或释放前终止锁?

Looking at ZODB again, it does look perfect for the task, but I really do want to avoid using any additional libraries. I'm still not entirely sure what all the issues with simply using flock are - I imagine one big problem is if a process is terminated while writing to the file, or before releasing the lock?

所以,我已经采取了synthesizerpatel的建议,并与sqlite3。如果任何人有兴趣,我决定为 dict 做一个替代,它将其条目作为pickles存储在数据库中(我不打算将任何内存保存为数据库访问和酸洗相比我做的一切快)。我确定有更有效的方法这样做(我不知道我是否仍然可能有并发问题),但这里是代码:

So, I've taken synthesizerpatel's advice and gone with sqlite3. If anyone's interested, I decided to make a drop-in replacement for dict that stores its entries as pickles in a database (I don't bother to keep any in memory as database access and pickling is fast enough compared to everything else I'm doing). I'm sure there are more efficient ways of doing this (and I've no idea whether I might still have concurrency issues), but here is the code:

from collections import MutableMapping
import sqlite3
import pickle


class PersistentDict(MutableMapping):
    def __init__(self, dbpath, iterable=None, **kwargs):
        self.dbpath = dbpath
        with self.get_connection() as connection:
            cursor = connection.cursor()
            cursor.execute(
                'create table if not exists memo '
                '(key blob primary key not null, value blob not null)'
            )
        if iterable is not None:
            self.update(iterable)
        self.update(kwargs)

    def encode(self, obj):
        return pickle.dumps(obj)

    def decode(self, blob):
        return pickle.loads(blob)

    def get_connection(self):
        return sqlite3.connect(self.dbpath)

    def  __getitem__(self, key):
        key = self.encode(key)
        with self.get_connection() as connection:
            cursor = connection.cursor()
            cursor.execute(
                'select value from memo where key=?',
                (key,)
            )
            value = cursor.fetchone()
        if value is None:
            raise KeyError(key)
        return self.decode(value[0])

    def __setitem__(self, key, value):
        key = self.encode(key)
        value = self.encode(value)
        with self.get_connection() as connection:
            cursor = connection.cursor()
            cursor.execute(
                'insert or replace into memo values (?, ?)',
                (key, value)
            )

    def __delitem__(self, key):
        key = self.encode(key)
        with self.get_connection() as connection:
            cursor = connection.cursor()
            cursor.execute(
                'select count(*) from memo where key=?',
                (key,)
            )
            if cursor.fetchone()[0] == 0:
                raise KeyError(key)
            cursor.execute(
                'delete from memo where key=?',
                (key,)
            )

    def __iter__(self):
        with self.get_connection() as connection:
            cursor = connection.cursor()
            cursor.execute(
                'select key from memo'
            )
            records = cursor.fetchall()
        for r in records:
            yield self.decode(r[0])

    def __len__(self):
        with self.get_connection() as connection:
            cursor = connection.cursor()
            cursor.execute(
                'select count(*) from memo'
            )
            return cursor.fetchone()[0]


推荐答案

sqlite3开箱即用提供 ACID 。文件锁定很容易出现竞争条件和并发问题,你不会使用sqlite3。

sqlite3 out of the box provides ACID. File locking is prone to race-conditions and concurrency problems that you won't have using sqlite3.

基本上,是的,sqlite3比你需要的更多,但它不是一个巨大的负担。它可以在手机上运行,​​所以它不像你承诺运行一些野兽的软件。这将节省您重新发明轮子和调试锁定问题的时间。

Basically, yeah, sqlite3 is more than what you need, but it's not a huge burden. It can run on mobile phones, so it's not like you're committing to running some beastly software. It's going to save you time reinventing wheels and debugging locking issues.

这篇关于在Python中的持久记忆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆