用很少的内存优化我的大数据代码 [英] Optimizing my large data code with little RAM

查看:82
本文介绍了用很少的内存优化我的大数据代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我保存了一个120 GB的文件(通过pickle以二进制形式),其中包含约50,000(600x600)2d numpy数组.我需要使用中位数来堆叠所有这些数组.最简单的方法是将整个文件作为数组列表读取并使用np.median(arrays, axis=0).但是,我没有太多的RAM可使用,因此这不是一个好的选择.

I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The easiest way to do this would be to simply read in the whole file as a list of arrays and use np.median(arrays, axis=0). However, I don't have much RAM to work with, so this is not a good option.

因此,我尝试将它们逐像素地堆叠,就像我一次专注于一个像素位置(i, j),然后一一读取每个数组,并将给定位置的值附加到列表中.一旦保存了所有数组中某个位置的所有值,我就使用np.median,然后只需要将该值保存在一个列表中-最终将获得每个像素位置的中值.最后,我可以将其重塑为600x600,然后就可以完成.下面的代码.

So, I tried to stack them pixel-by-pixel, as in I focus on one pixel position (i, j) at a time, then read in each array one by one, appending the value at the given position to a list. Once all the values for a certain position across all arrays are saved, I use np.median and then just have to save that value in a list -- which in the end will have the medians of each pixel position. In the end I can just reshape this to 600x600, and I'll be done. The code for this is below.

import pickle
import time
import numpy as np

filename = 'images.dat' #contains my 50,000 2D numpy arrays

def stack_by_pixel(i, j):
    pixels_at_position = []
    with open(filename, 'rb') as f:
        while True:
            try:
                # Gather pixels at a given position
                array = pickle.load(f)
                pixels_at_position.append(array[i][j])
            except EOFError:
                break
    # Stacking at position (median)
    stacked_at_position = np.median(np.array(pixels_at_position))
    return stacked_at_position

# Form whole stacked image
stacked = []
for i in range(600):
    for j in range(600):
        t1 = time.time()
        stacked.append(stack_by_pixel(i, j))
        t2 = time.time()
        print('Done with element %d, %d: %f seconds' % (i, j, (t2-t1)))

stacked_image = np.reshape(stacked, (600,600))

看了一些时间的打印输出后,我意识到这是非常低效的.位置(i, j)的每次完成大约需要150秒左右,这并不奇怪,因为它正在逐一读取大约50,000个数组.考虑到我的大型阵列中有360,000个(i, j)职位,这预计需要22个月才能完成!显然这是不可行的.但是我有点无所适从,因为没有足够的RAM来读取整个文件.或者也许我可以一次保存所有像素位置(每个位置一个单独的列表),因为它会一一打开它们,但不会在Python中保存360,000个列表(长约50,000个元素),这会占用很多内存也一样?

After seeing some of the time printouts, I realize that this is wildly inefficient. Each completion of a position (i, j) takes about 150 seconds or so, which is not surprising since it is reading about 50,000 arrays one by one. And given that there are 360,000 (i, j) positions in my large arrays, this is projected to take 22 months to finish! Obviously this isn't feasible. But I'm sort of at a loss, because there's not enough RAM available to read in the whole file. Or maybe I could save all the pixel positions at once (a separate list for each position) for the arrays as it opens them one by one, but wouldn't saving 360,000 lists (that are about 50,000 elements long) in Python use a lot of RAM as well?

欢迎提出任何建议,以使我可以在不使用大量RAM的情况下使运行速度显着提高.谢谢!

Any suggestions are welcome for how I could make this run significantly faster without using a lot of RAM. Thanks!

推荐答案

注意:我使用Python 2.x,将其移植到3.x并不难.

Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.

我的想法很简单-磁盘空间足够,所以让我们进行一些预处理,然后将大的pickle文件转换为更容易在小块中处理的文件.

My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.

为了对此进行测试,我编写了一个小脚本,生成一个类似于您的泡菜文件.我假设您输入的图像是灰度图像,深度为8位,并使用 numpy.random.randint .

In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.

此脚本将作为基准,我们可以将其与预处理和处理阶段进行比较.

This script will act as a benchmark that we can compare the preprocessing and processing stages against.

import numpy as np
import pickle
import time

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000

t1 = time.time()

with open('data/raw_data.pickle', 'wb') as f:
    for i in range(FILE_COUNT):
        data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
        data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
        pickle.dump(data, f)
        print i,

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)

在测试运行中,此脚本在372秒内完成,生成了约10 GB的文件.

In a test run this script completed in 372 seconds, generating ~ 10 GB file.

让我们逐行拆分输入图像-我们将有600个文件,其中文件N包含每个输入图像的行N.我们可以使用 numpy.fromfile ).

Let's split the input images on a row-by-row basis -- we will have 600 files, where file N contains row N from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).

import numpy as np
import pickle
import time

# Increase open file limit
# See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
import win32file
win32file._setmaxstdio(1024)

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000

t1 = time.time()

outfiles = []
for i in range(IMAGE_HEIGHT):
    outfilename = 'data/row_%03d.dat' % i
    outfiles.append(open(outfilename, 'wb'))


with open('data/raw_data.pickle', 'rb') as f:
    for i in range(FILE_COUNT):
        data = pickle.load(f)
        for j in range(IMAGE_HEIGHT):
            data[j].tofile(outfiles[j])
        print i,

for i in range(IMAGE_HEIGHT):
    outfiles[i].close()

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)

在测试运行中,此脚本在134秒内完成,生成了600个文件,每个文件600万字节.它使用了大约30MB或RAM.

In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.

简单,只需使用加载每个数组numpy.fromfile ,然后使用 numpy.median 获取每个列的中位数,将其减少为单行,并在列表中累积此类行.

Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.

最后,使用 numpy.vstack 进行操作重新组装中间图像.

Finally, use numpy.vstack to reassemble a median image.

import numpy as np
import time

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600

t1 = time.time()

result_rows = []

for i in range(IMAGE_HEIGHT):
    outfilename = 'data/row_%03d.dat' % i
    data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
    median_row = np.median(data, axis=0)
    result_rows.append(median_row)
    print i,

result = np.vstack(result_rows)
print result

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)

在测试运行中,此脚本在74秒内完成.您甚至可以很容易地并行化它,但这似乎不值得.该脚本使用了约40MB的RAM.

In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.

鉴于这两个脚本的线性关系,所用时间也应线性变化.对于50000张图像,预处理大约需要11分钟,而最终处理大约需要6分钟.这是在i7-4930K @ 3.4GHz上,故意使用32位Python.

Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K @ 3.4GHz, using 32bit Python on purpose.

这篇关于用很少的内存优化我的大数据代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆