数百万对单个整数与一批(2 到 100 个)整数配对的最佳数据类型(就速度/RAM 而言) [英] Best data type (in terms of speed/RAM) for millions of pairs of a single int paired with a batch (2 to 100) of ints

查看:55
本文介绍了数百万对单个整数与一批(2 到 100 个)整数配对的最佳数据类型(就速度/RAM 而言)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约 1500 万对由一个整数组成,与一批(2 到 100 个)其他整数配对.

I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.

如果有所不同,整数本身的范围从 0 到 1500 万.

If it makes a difference, the ints themselve range from 0 to 15 million.

我考虑过使用:

Pandas,将批次存储为 Python 列表

Pandas, storing the batches as python lists

Numpy,批处理存储为它自己的 numpy 数组(因为 numpy 不允许在其二维数据结构中使用可变长度的行)

Numpy, where the batch is stored as it's own numpy array (since numpy doesn't allow variable length rows in it's 2D data structures)

Python 列表列表.

Python List of Lists.

我也看过 Tensorflow tfrecords,但对这个不太确定.

I also looked at Tensorflow tfrecords but not too sure about this one.

我只有大约 12 GB 的 RAM.我还将用于训练机器学习算法,所以

I only have about 12 gbs of RAM. I will also be using to train over a machine learning algorithm so

推荐答案

我会做以下事情:

# create example data 
A = np.random.randint(0,15000000,100)                                      
B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]

int32 就足够了

A32 = A.astype(np.int32)

我们想将所有批次粘合在一起.首先,写下批量大小,以便我们稍后将它们分开.

We want to glue all the batches together. First, write down the batch sizes so we can separate them later.

from itertools import chain

sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
boundaries = sizes.cumsum()

# force int32
B_all = np.empty(boundaries[-1],np.int32)
np.concatenate(B,out=B_all)

粘合后重新拆分.

B32 = np.split(B_all, boundaries[1:-1])

最后,为方便起见,制作一组对:

Finally, make an array of pairs for convenience:

pairs = np.rec.fromarrays([A32,B32],names=["first","second"])

粘合然后再次分裂的重点是什么?

What was the point of glueing and then splitting again?

首先,请注意重新拆分的数组都是B_all 的视图,因此我们不会因为同时拥有两者而浪费太多内存.此外,如果我们修改 B_all_B32(或其中的一些元素),另一个也会自动更新.

First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.

使用 B_all 的好处是通过 numpy 的 reduceat ufunc 方法提高效率.例如,如果我们想要所有批次的平均值,我们可以执行 np.add.reduceat(B_all, bounds[:-1])/sizes 这比循环遍历 pairs['second']

The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']

这篇关于数百万对单个整数与一批(2 到 100 个)整数配对的最佳数据类型(就速度/RAM 而言)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆