以矢量化方式连接给定开始、停止数字的范围数组 - NumPy [英] Concatenate range arrays given start, stop numbers in a vectorized way - NumPy
问题描述
我有两个感兴趣的矩阵,第一个是词袋"矩阵,有两列:文档 ID 和术语 ID.例如:
I have two matrices of interest, the first is a "bag of words" matrix, with two columns: the document ID and the term ID. For example:
bow[0:10]
Out[1]:
array([[ 0, 10],
[ 0, 12],
[ 0, 19],
[ 0, 20],
[ 1, 9],
[ 1, 24],
[ 2, 33],
[ 2, 34],
[ 2, 35],
[ 3, 2]])
此外,我有一个索引"矩阵,其中矩阵中的每一行都包含词袋矩阵中给定文档 ID 的第一行和最后一行的索引.例如:第 0 行是 doc id 0 的第一个和最后一个索引.例如:
In addition, I have an "index" matrix, where every row in the matrix contains the index of the first and last row for a given document ID in the bag of words matrix. Ex: row 0 is the first and last index for doc id 0. For example:
index[0:4]
Out[2]:
array([[ 0, 4],
[ 4, 6],
[ 6, 9],
[ 9, 10]])
我想做的是随机抽取文档 ID 的样本,并获取这些文档 ID 的所有单词行包.词袋矩阵大约有 150M 行(~1.5Gb),所以使用 numpy.in1d() 太慢了.我们需要快速返回这些以供下游任务使用.
What I'd like to do is take a random sample of document ID's and get all of the bag of word rows for those document ID's. The bag of words matrix is roughly 150M rows (~1.5Gb), so using numpy.in1d() is too slow. We need to return these rapidly for feeding into a downstream task.
我想出的幼稚解决方案如下:
The naive solution I have come up with is as follows:
def get_rows(ids):
indices = np.concatenate([np.arange(x1, x2) for x1,x2 in index[ids]])
return bow[indices]
get_rows([4,10,3,5])
通用示例
提出问题的通用示例是这样的 -
A generic sample to put forth the problem would be with something like this -
indices = np.array([[ 4, 7],
[10,16],
[11,18]]
预期的输出是 -
array([ 4, 5, 6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])
推荐答案
我想我终于用 cumsum
用于矢量化解决方案的技巧 -
Think I have cracked it finally with a cumsum
trick for a vectorized solution -
def create_ranges(a):
l = a[:,1] - a[:,0]
clens = l.cumsum()
ids = np.ones(clens[-1],dtype=int)
ids[0] = a[0,0]
ids[clens[:-1]] = a[1:,0] - a[:-1,1]+1
out = ids.cumsum()
return out
样品运行 -
In [416]: a = np.array([[4,7],[10,16],[11,18]])
In [417]: create_ranges(a)
Out[417]: array([ 4, 5, 6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])
In [425]: a = np.array([[-2,4],[-5,2],[11,12]])
In [426]: create_ranges(a)
Out[426]: array([-2, -1, 0, 1, 2, 3, -5, -4, -3, -2, -1, 0, 1, 11])
如果给定的开始和停止作为两个 1D
数组,我们只需要使用它们代替第一列和第二列.为了完整起见,这里是完整的代码 -
If we are given starts and stops as two 1D
arrays, we just need to use those in place of the first and second columns. For completeness, here's the complete code -
def create_ranges(starts, ends):
l = ends - starts
clens = l.cumsum()
ids = np.ones(clens[-1],dtype=int)
ids[0] = starts[0]
ids[clens[:-1]] = starts[1:] - ends[:-1]+1
out = ids.cumsum()
return out
这篇关于以矢量化方式连接给定开始、停止数字的范围数组 - NumPy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!