何时在ZODB中提交数据 [英] when to commit data in ZODB

查看:82
本文介绍了何时在ZODB中提交数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理以下代码生成的数据:

I am trying to handel the data generated by the following piece of code:

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])

由于字典很大(10000个键X 10000个列表,每个包含3个元素),很难将其保存在内存中.我正在寻找一种解决方案,该解决方案将在生成键:值(以列表的形式)对后立即对其进行存储.建议在此以特定格式编写和阅读字典( Python),将ZODB与Btree结合使用.

Since the dictionary is large (10000 keys X 10000 list with 3 elements each), it is difficult to keep it in memory. I was looking for a solution which stores the key:value (in the form of list) pair as soon as they are generated. It was advised here, Writing and reading a dictionary in specific format (Python), to use ZODB in combination with Btree.

如果这太天真,请忍受我,我的问题是,什么时候应该调用transaction.commit()提交数据?如果我在内循环的末尾调用它,则生成的文件将非常大(不确定原因).这是一个代码段:

Bear with me if this is too naive, my question is, when should one call transaction.commit() to commit the data ? If I call it at the end of the inner loop, the resulting file is extremely large ( not sure why). Here is a snippet:

storage = FileStorage('Data.fs')
db = DB(store)
connection = db.open()
root = connection.root()
btree_container = IOBTree
root[0] = btree_container 
for nodes in G.nodes()
    btree_container[nodes] = PersistentList () ## I was loosing data prior to doing this 

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
        transaction.commit()

如果我在两个循环之外都调用它怎么办?像这样:

What if I call it outside both the loops? Something like:

    ......
       ......
          score = SomeOperation on (Gvalue,Hvalue)
          btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
    transaction.commit()

是否将所有数据保留在内存中,直到我调用transaction.commit()?同样,我不确定为什么,但是这会导致磁盘上的文件较小.

Will all the data be held in the memory till I call transaction.commit()? Again, I am not sure why but this results in a smaller file size on disk.

我想最小化保存在内存中的数据.任何指导将不胜感激!

I want to minimize the data being held in the memory. Any guidance would be appreciated !

推荐答案

您的目标是使您的进程在内存限制内可管理.为了能够使用ZODB作为工具来执行此操作,您需要了解ZODB事务如何工作以及如何使用它们.

Your goal is to make your process manageable within memory constraints. To be able to do this with the ZODB as a tool you need to understand how ZODB transactions work, and how to use them.

首先,您需要在这里了解事务提交的功能,这也解释了您的Data.fs为何变得如此之大.

First of all you need to understand what a transaction commit does here, which also explains why your Data.fs is getting so large.

ZODB将每个事务写出数据,任何已更改的持久对象都将写入磁盘.这里的重要细节是已更改的持久对象; ZODB以持久对象为单位.

The ZODB writes data out per transaction, where any persistent object that has changed gets written to disk. The important detail here is persistent object that has changed; the ZODB works in units of persistent objects.

并非每个python值都是一个持久对象.如果我定义了一个简单的python类,它将不会是持久的,也不会是任何内置的python类型,例如int或list.另一方面,您定义的任何从persistence.Persistent 继承的类都是. BTrees类集以及在代码 do 中使用的PeristentList类都继承自Persistent.

Not every python value is a persistent object. If I define a straight-up python class, it will not be persistent, nor are any of the built-in python types such as int or list. On the other hand, any class you define that inherits from persistence.Persistent is a persistent object. The BTrees set of classes, as well as the PeristentList class you use in your code do inherit from Persistent.

现在,在事务提交时,任何已更改的持久对象 都会作为该事务的一部分写入磁盘.因此,任何已附加到PersistentList对象将全部写入 到磁盘. BTrees更有效地处理此问题;它们存储自己持久的存储桶,这些存储桶又存储实际存储的对象.因此,对于您创建的每个新节点,将桶存储到事务中,而不是整个BTree结构中.请注意,由于树中保存的项目本身就是持久性对象,因此只有对它们的引用才会存储在存储桶记录中.

Now, on a transaction commit, any persistent object that has changed is written to disk as part of that transaction. So any PersistentList object that has been append to will be written in it's entirety to disk. BTrees handle this a little more efficient; they store Buckets, themselves persistent, which in turn hold the actually stored objects. So for every few new nodes you create, a Bucket is written to the transaction, not the whole BTree structure. Note that because the items held in the tree are themselves persistent objects only references to them are stored in the Bucket records.

现在,ZODB通过将事务数据附加到Data.fs文件来写入事务数据,并且它不会自动删除旧数据.它可以通过从商店中查找给定对象的最新版本来构造数据库的当前状态.这就是为什么您的Data.fs增长如此之快,并在提交事务时写出越来越大的PersistentList实例的新版本的原因.

Now, the ZODB writes transaction data by appending it to the Data.fs file, and it does not remove old data automatically. It can construct the current state of the database by finding the most recent version of a given object from the store. This is why your Data.fs is growing so much, you are writing out new versions of larger and larger PersistentList instances as transactions are committed.

删除旧数据称为打包,它类似于PostgreSQL和其他关系数据库中的VACUUM命令.只需调用 .pack()方法db变量上删除 all 的旧版本,或使用该方法的tdays参数设置保留多少历史记录的限制,第一个是时间戳(自纪元以来的秒数),您可以在此日期之前打包,而days是从当前时间开始保留的过去天数,如果指定,则为t.由于删除了较旧事务中的部分列表,打包应会大大减少您的数据文件.请注意,打包是一项昂贵的操作,因此可能需要一段时间,具体取决于数据集的大小.

Removing the old data is called packing, which is similar to the VACUUM command in PostgreSQL and other relational databases. Simply call the .pack() method on the db variable to remove all old revisions, or use the t and days parameters of that method to set limits on how much history to retain, the first is a time.time() timestamp (seconds since the epoch) before which you can pack, and days is the number of days in the past to retain from current time or t if specified. Packing should reduce your data file considerably as the partial lists in older transactions are removed. Do note that packing is an expensive operation and thus can take a while, depending on the size of your dataset.

您正在尝试通过使用持久性来解决内存约束来构建大型数据集,并正在使用事务尝试将事物刷新到磁盘.但是,通常,使用事务提交信号,您已经完成了数据集的构建,您可以将其用作一个原子整体.

You are trying to build a very large dataset, by using persistence to work around constraints with memory, and are using transactions to try and flush things to disk. Normally, however, using a transaction commit signals you have completed constructing your dataset, something you can use as one atomic whole.

您需要在这里使用 savepoint .保存点本质上是子事务,是整个事务中的一个点,您可以在该点上请求临时将数据存储在磁盘上.当您提交交易时,它们将变为永久性.要创建保存点,请在事务上调用 .savepoint方法:

What you need to use here is a savepoint. Savepoints are essentially subtransactions, a point during the whole transaction where you can ask for data to be temporarily stored on disk. They'll be made permanent when you commit the transaction. To create a savepoint, call the .savepoint method on the transaction:

for Gnodes in G.nodes():      # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes():  # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes, PersistentList()).append(
            [Hnodes, score, -1 ])
    transaction.savepoint(True)
transaction.commit()

在上面的示例中,我将optimistic标志设置为True,这意味着:我不打算回滚到该保存点;一些存储不支持回滚,并且通过信号通知您不需要这样做,可以使您的代码在这种情况下正常工作.

In the above example I set the optimistic flag to True, meaning: I do not intent to roll back to this savepoint; some storages do not support rolling back, and signalling you do not need this makes your code work in such situations.

还请注意,当处理完整个数据集后,transaction.commit()就会发生,这应该是提交要实现的.

Also note that the transaction.commit() happens when the whole data set has been processed, which is what a commit is supposed to achieve.

保存点要做的一件事是调用ZODB缓存的垃圾回收,这意味着将从内存中删除当前未使用的任何数据.

One thing a savepoint does, is call for a garbage collection of the ZODB caches, which means that any data not currently in use is removed from memory.

请注意此处的当前未使用"部分;如果您的任何代码保留变量中的大数值,则无法从内存中清除数据.据我从您向我们展示的代码中确定的范围来看,这看起来还不错.但是我不知道您的操作如何工作或如何生成节点.请注意避免在迭代器执行操作时在内存中构建完整的列表,或者避免生成大型字典,例如,引用了您所有列表的列表.

Note the 'not currently in use' part there; if any of your code holds on to large values in a variable the data cannot be cleared from memory. As far as I can determine from the code you've shown us, this looks fine. But I do not know how your operations work or how you generate the nodes; be careful to avoid building complete lists in memory there when an iterator will do, or build large dictionaries where all your lists of lists are referenced, for example.

您可以对创建保存点的位置进行一些试验;您可以在每次处理一个HNodes时创建一个,或者仅在像我上面所做的那样使用GNodes循环完成时创建一个.您正在根据每个GNodes构造一个列表,因此无论如何都要循环遍历所有H.nodes()并将其保存在内存中,并且只有在完全构造完它之后,刷新到磁盘才有意义.

You can experiment a little as to where you create your savepoints; you could create one every time you've processed one HNodes, or only when done with a GNodes loop like I've done above. You are constructing a list per GNodes, so it would be kept in memory while looping over all the H.nodes() anyway, and flushing to disk would probably only make sense once you've completed constructing it in full.

但是,如果发现需要更频繁地清除内存,则应考虑使用BTrees.OOBTree.TreeSet类或BTrees.IOBTree.BTree类而不是PersistentList来将数据分解为更持久的对象. TreeSet是有序的,但不能(容易)建立索引,而BTree可以通过使用简单的递增索引键用作列表:

If, however, you find that you need to clear memory more often, you should consider using either a BTrees.OOBTree.TreeSet class or a BTrees.IOBTree.BTree class instead of a PersistentList to break up your data into more persistent objects. A TreeSet is ordered but not (easily) indexable, while a BTree could be used as a list by using simple incrementing index keys:

for i, Hnodes in enumerate(H.nodes()):
    ...
    btree_container.setdefault(Gnodes, IOBTree())[i] = [Hnodes, score, -1]
    if i % 100 == 0:
        transaction.savepoint(True)

上面的代码使用BTree而不是PersistentList并每处理100个HNodes创建一个保存点.由于BTree使用存储桶,存储桶本身就是持久性对象,因此可以更轻松地将整个结构刷新到保存点,而不必为所有要处理的H.nodes()保留在内存中.

The above code uses a BTree instead of a PersistentList and creates a savepoint every 100 HNodes processed. Because the BTree uses buckets, which are persistent objects in themselves, the whole structure can be flushed to a savepoint more easily without having to stay in memory for all H.nodes() to be processed.

这篇关于何时在ZODB中提交数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆