用Python将数据写入LMDB非常慢 [英] Writing data to LMDB with Python very slow

查看:496
本文介绍了用Python将数据写入LMDB非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

创建数据集以使用 Caffe 进行训练,我都尝试使用HDF5和LMDB.但是,创建LMDB非常慢,甚至比HDF5还慢.我正在尝试编写约20,000张图片.

Creating datasets for training with Caffe I both tried using HDF5 and LMDB. However, creating a LMDB is very slow even slower than HDF5. I am trying to write ~20,000 images.

我做错什么了吗?有我不知道的东西吗?

Am I doing something terribly wrong? Is there something I am not aware of?

这是我创建LMDB的代码:

This is my code for LMDB creation:

DB_KEY_FORMAT = "{:0>10d}"
db = lmdb.open(path, map_size=int(1e12))
    curr_idx = 0
    commit_size = 1000
    for curr_commit_idx in range(0, num_data, commit_size):
        with in_db_data.begin(write=True) as in_txn:
            for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)):
                d, l = data[i], labels[i]
                im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l))
                key = DB_KEY_FORMAT.format(curr_idx)
                in_txn.put(key, im_dat.SerializeToString())
                curr_idx += 1
    db.close()

您可以看到我正在为每1000张图像创建一个事务,因为我认为为每个图像创建一个事务会产生开销,但这似乎并不会对性能产生太大影响.

As you can see I am creating a transaction for every 1,000 images, because I thought creating a transaction for each image would create an overhead, but it seems this doesn't influence performance too much.

推荐答案

根据我的经验,我已经 50-100毫秒从Python写入LMDB 在Ubuntu的ext4硬盘上写入Caffe数据. 这就是为什么我使用tmpfs (Linux内置的 RAM磁盘功能)并在大约 0.07毫秒的时间内完成这些写入的原因.您可以在虚拟磁盘上创建较小的数据库,然后将它们复制到硬盘上,然后再在所有数据库上进行训练.我有大约20-40GB的存储空间,因为我有64 GB的RAM.

In my experience, I've had 50-100 ms writes to LMDB from Python writing Caffe data on ext4 hard disk on Ubuntu. That's why I use tmpfs (RAM disk functionality built into Linux) and get these writes done in around 0.07 ms. You can make smaller databases on your ramdisk and copy them to a hard disk and later train on all of them. I'm making around 20-40GB ones as I have 64 GB of RAM.

一些代码段可帮助您动态创建,填充LMDB并将其移动到存储中.随时对其进行编辑以适合您的情况.这样可以节省您一些时间来掌握LMDB和Python中的文件操作方式.

Some pieces of code to help you guys dynamically create, fill and move LMDBs to storage. Feel free to edit it to fit your case. It should save you some time getting your head around how LMDB and file manipulation works in Python.

import shutil
import lmdb
import random


def move_db():
    global image_db
    image_db.close();
    rnd = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5))
    shutil.move( fold + 'ram/train_images',  '/storage/lmdb/'+rnd)
    open_db()


def open_db():
    global image_db
    image_db    = lmdb.open(os.path.join(fold, 'ram/train_images'),
            map_async=True,
            max_dbs=0)

def write_to_lmdb(db, key, value):
    """
    Write (key,value) to db
    """
    success = False
    while not success:
        txn = db.begin(write=True)
        try:
            txn.put(key, value)
            txn.commit()
            success = True
        except lmdb.MapFullError:
            txn.abort()
            # double the map_size
            curr_limit = db.info()['map_size']
            new_limit = curr_limit*2
            print '>>> Doubling LMDB map size to %sMB ...' % (new_limit>>20,)
            db.set_mapsize(new_limit) # double it

...

image_datum                 = caffe.io.array_to_datum( transformed_image, label )
write_to_lmdb(image_db, str(itr), image_datum.SerializeToString())

这篇关于用Python将数据写入LMDB非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆