用数字数据比cPickle泡菜更快? [英] pickle faster than cPickle with numeric data?

查看:107
本文介绍了用数字数据比cPickle泡菜更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我正在使用Python进行图像检索.在此示例中,从图像提取的关键点和描述符表示为numpy.array.第一个形状为(2000,5),第二个形状为(2000,128).两者都只包含dtype=numpy.float32的值.

currently I'm working on image retrieval with Python. The keypoints and descriptors extracted from an image in this example are represented as numpy.arrays. The first one of shape (2000, 5) and the latter of shape (2000, 128). Both containing only values of dtype=numpy.float32.

因此,我想知道使用哪种格式来保存提取的关键点和描述符. IE.我总是保存2个文件:一个用于关键点,一个用于描述符-这在我的测量中算是一步.我比较了picklecPickle(都使用协议0和2)和NumPy的二进制格式.pny,结果确实让我感到困惑:

So, I was wondering which format to use in order to save my extracted keypoints and descriptors. I.e. I'm always saving 2 files: one for the keypoints and one for the descriptors - this counts as one step in my measurements. I compared pickle, cPickle (both with protocol 0 and 2) and NumPy's binary format .pny and the results are really confusing me:

我一直认为cPickle应该比pickle模块要快.但尤其是协议0的加载时间确实会留在结果中. 有人对此有解释吗?是因为我只使用数字数据吗?似乎很奇怪...

I always thought cPickle is supposed to be faster than the pickle module. But especially the load time with protocol 0 really sticks out in the results. Does anyone have an explanation for this? Is it because I'm only using numeric data? Seems strange...

PS:在我的代码中,我基本上在每种技术上循环了1000次(number=1000),并最终平均了所测量的时间:

PS: In my code I'm basically looping 1000 times (number=1000) over each technique and average the measured time in the end:

    timer = time.time

    print 'npy save...'
    t0 = timer()
    for i in range(number):
        numpy.save(npy_kp_path, kp)
        numpy.save(npy_descr_path, descr)
    t1 = timer()
    results['npy']['save'] = t1 - t0

    print 'npy load...'
    t0 = timer()
    for i in range(number):
        kp = numpy.load(npy_kp_path)
        descr = numpy.load(npy_descr_path)
    t1 = timer()
    results['npy']['load'] = t1 - t0


    print 'pickle protocol 0 save...'
    t0 = timer()
    for i in range(number):
        with open(pkl0_descr_path, 'wb') as f:
            pickle.dump(descr, f, protocol=0)
        with open(pkl0_kp_path, 'wb') as f:
            pickle.dump(kp, f, protocol=0)
    t1 = timer()
    results['pkl0']['save'] = t1 - t0

    print 'pickle protocol 0 load...'
    t0 = timer()
    for i in range(number):
        with open(pkl0_descr_path, 'rb') as f:
            descr = pickle.load(f)
        with open(pkl0_kp_path, 'rb') as f:
            kp = pickle.load(f)
    t1 = timer()
    results['pkl0']['load'] = t1 - t0


    print 'cPickle protocol 0 save...'
    t0 = timer()
    for i in range(number):
        with open(cpkl0_descr_path, 'wb') as f:
            cPickle.dump(descr, f, protocol=0)
        with open(cpkl0_kp_path, 'wb') as f:
            cPickle.dump(kp, f, protocol=0)
    t1 = timer()
    results['cpkl0']['save'] = t1 - t0

    print 'cPickle protocol 0 load...'
    t0 = timer()
    for i in range(number):
        with open(cpkl0_descr_path, 'rb') as f:
            descr = cPickle.load(f)
        with open(cpkl0_kp_path, 'rb') as f:
            kp = cPickle.load(f)
    t1 = timer()
    results['cpkl0']['load'] = t1 - t0


    print 'pickle highest protocol (2) save...'
    t0 = timer()
    for i in range(number):
        with open(pkl2_descr_path, 'wb') as f:
            pickle.dump(descr, f, protocol=pickle.HIGHEST_PROTOCOL)
        with open(pkl2_kp_path, 'wb') as f:
            pickle.dump(kp, f, protocol=pickle.HIGHEST_PROTOCOL)
    t1 = timer()
    results['pkl2']['save'] = t1 - t0

    print 'pickle highest protocol (2) load...'
    t0 = timer()
    for i in range(number):
        with open(pkl2_descr_path, 'rb') as f:
            descr = pickle.load(f)
        with open(pkl2_kp_path, 'rb') as f:
            kp = pickle.load(f)
    t1 = timer()
    results['pkl2']['load'] = t1 - t0


    print 'cPickle highest protocol (2) save...'
    t0 = timer()
    for i in range(number):
        with open(cpkl2_descr_path, 'wb') as f:
            cPickle.dump(descr, f, protocol=cPickle.HIGHEST_PROTOCOL)
        with open(cpkl2_kp_path, 'wb') as f:
            cPickle.dump(kp, f, protocol=cPickle.HIGHEST_PROTOCOL)
    t1 = timer()
    results['cpkl2']['save'] = t1 - t0

    print 'cPickle highest protocol (2) load...'
    t0 = timer()
    for i in range(number):
        with open(cpkl2_descr_path, 'rb') as f:
            descr = cPickle.load(f)
        with open(cpkl2_kp_path, 'rb') as f:
            kp = cPickle.load(f)
    t1 = timer()
    results['cpkl2']['load'] = t1 - t0 

推荐答案

ndarray的数字数据(的二进制表示)腌制为一个长字符串.从协议0文件中解开大字符串时,看来cPickle确实比pickle慢得多.为什么?我的猜测是pickle利用了标准库中精心调整的字符串算法,而cPickle却落在了后面.

The (binary representation of) the numeric data of an ndarray is pickled as one long string. It appears that cPickle is indeed much slower than pickle in unpickling large strings from protocol 0 files. Why? My guess is that pickle makes use of well-tuned string algorithms from the standard library and cPickle has fallen behind.

上面的观察来自于Python 2.7的使用.自动使用C扩展的Python 3.3比Python 2.7上的任何一个模块都快,因此显然问题已得到解决.

The observation above is from playing with Python 2.7. Python 3.3, which uses a C extension automatically, is faster than either module on Python 2.7, so apparently the issue has been fixed.

这篇关于用数字数据比cPickle泡菜更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆