如何从对于内存太大的文件构建(或预先计算)直方图? [英] How to build (or precompute) a histogram from a file too large for memory?

查看:102
本文介绍了如何从对于内存太大的文件构建(或预先计算)直方图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有用于python的图形库,不需要将所有原始数据点存储为 numpy 数组或列表即可绘制直方图?

Is there a graphing library for python that doesn't require storing all raw data points as a numpy array or list in order to graph a histogram?

我的数据集太大而无法存储,并且我不想使用子采样来减小数据大小.

I have a dataset too large for memory, and I don't want to use subsampling to reduce the data size.

我正在寻找的是一个库,该库可以获取生成器的输出(每个文件生成的数据点为 float ),并动态生成直方图

What I'm looking for is a library that can take the output of a generator (each data point yielded from a file, as a float), and build a histogram on the fly.

这包括计算生成器产生的文件中每个数据点的bin大小.

This includes computing bin size as the generator yields each data point from the file.

如果这样的库不存在,我想知道numpy是否能够根据产生的数据点预计算{bin_1:count_1, bin_2:count_2...bin_x:count_x}的计数器.

If such a library doesn't exist, I'd like to know if numpy is able to precompute a counter of {bin_1:count_1, bin_2:count_2...bin_x:count_x} from yielded datapoints.

数据点作为垂直矩阵保存在选项卡文件中,由node-node-score排列,如下所示:

Datapoints are held as a vertical matrix, in a tab file, arranged by node-node-score like below:

node   node   5.55555

更多信息:

  • 104301133行中的数据(到目前为止)
  • 我不知道最小值或最大值
  • bin宽度应该相同
  • 箱数可能是1000

尝试的答案:

low = np.inf
high = -np.inf

# find the overall min/max
chunksize = 1000
loop = 0
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=chunksize, delimiter='\t'):
    low = np.minimum(chunk.iloc[:, 2].min(), low)
    high = np.maximum(chunk.iloc[:, 2].max(), high)
    loop += 1
lines = loop*chunksize

nbins = math.ceil(math.sqrt(lines))   

bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64)  # np.ndarray filled with np.uint32 zeros, CHANGED TO int64


# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=2, delimiter='\t'):

    # compute bin counts over the 3rd column
    subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)  # np.ndarray filled with np.int64

    # accumulate bin counts over chunks
    total += subtotal


plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
# plt.bar(np.arange(total.shape[0]), total, width=1)
plt.savefig('gsl_test_hist.svg')

输出:

推荐答案

您可以遍历数据集的大块并使用

You could iterate over chunks of your dataset and use np.histogram to accumulate your bin counts into a single vector (you would need to define your bin edges a priori and pass them to np.histogram using the bins= parameter), e.g.:

import numpy as np
import pandas as pd

bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.uint)

# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):

    # compute bin counts over the 3rd column
    subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)

    # accumulate bin counts over chunks
    total += subtotal.astype(np.uint)

如果您要确保bin跨数组值的整个范围,但是您不知道最小值和最大值,则需要事先遍历一次以计算这些值(例如,使用/np.max),例如:

If you want to ensure that your bins span the full range of values in your array, but you don't already know the minimum and maximum then you will need to loop over it once beforehand to compute these (e.g. using np.min/np.max), for example:

low = np.inf
high = -np.inf

# find the overall min/max
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):
    low = np.minimum(chunk.iloc[:, 2].min(), low)
    high = np.maximum(chunk.iloc[:, 2].max(), high)

一旦您有了垃圾箱计数数组,就可以直接使用

Once you have your array of bin counts, you can then generate a bar plot directly using plt.bar:

plt.bar(bin_edges[:-1], total, width=1)

还可以使用weights=参数来 plt.hist 以便从计数的矢量而不是样本生成直方图,例如:

It's also possible to use the weights= parameter to plt.hist in order to generate a histogram from a vector of counts rather than samples, e.g.:

plt.hist(bin_edges[:-1], bins=bin_edges, weights=total, ...)

这篇关于如何从对于内存太大的文件构建(或预先计算)直方图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆