大数组的 Numpy 直方图 [英] Numpy histogram of large arrays

查看:43
本文介绍了大数组的 Numpy 直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆 csv 数据集,每个大约 10Gb.我想从他们的列中生成直方图.但似乎在 numpy 中执行此操作的唯一方法是首先将整个列加载到 numpy 数组中,然后对该数组调用 numpy.histogram.这会消耗不必要的内存量.

I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogram on that array. This consumes an unnecessary amount of memory.

numpy 是否支持在线分箱?我希望在读取它们时逐行迭代我的 csv 和 bin 值.这样在任何时候内存中最多有一行.

Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.

自己动手并不难,但想知道是否有人已经发明了这个轮子.

Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.

推荐答案

正如您所说,推出自己的方案并不难.您需要自己设置 bin 并在迭代文件时重复使用它们.以下应该是一个不错的起点:

As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:

import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
    d = np.random.randn(1000,1)
    htemp, jnk = np.histogram(d, mybins)
    myhist += htemp

我猜这么大的文件会导致性能问题,而且在每一行上调用直方图的开销可能太慢了.@doug 的建议 生成器似乎是解决这个问题的好方法问题.

I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. @doug's suggestion of a generator seems like a good way to address that problem.

这篇关于大数组的 Numpy 直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆