以1个I / O传递方式提取numpy字典中最快的方式 [英] Fastest way to extract dictionary of sums in numpy in 1 I/O pass
问题描述
arr = np.array([[1,20,5]
[1,20,8],
[3,10,4],
[2,30,6],
[3,10,5]])
,我想形成一个字符串,为每行匹配每个值的第三列的总和在第一列中,即返回 {1:13,2:6,3:9}
。为了使事情变得更具挑战性,我的阵列中有10亿行,第一列有10万行。
方法1:可以调用 np.unique()
然后通过 np.where()
和 np.sum()
在包含列表理解的单行字典中。如果我有一小部分独特的元素,这将是相当快的,但是在100k独特的元素,我会招致大量浪费的页面抓取,使整个阵列的100k I / O通行。
方法2:我可以进行最后一列的单个I / O传递(因为每行的散列第1列可能比过多的页面提取便宜),但是我失去了numpy的C内圈向量化在这里。
有没有快速的方法来实现方法2而不诉诸纯Python循环?
numpy方法:
u = np.unique(arr [ ,0])
s =((arr [:, [0]] == u)* arr [:, [2]])sum(0)
dict(np。 stack([u,s])。T)
{1:13,2:6,3:9}
熊猫方法:
将大熊猫导入为pd
导入numpy为np
pd.DataFrame(arr,columns = list('ABC'))。groupby('A')。C.sum()。to_dict()
{1:13,2:6,3:9}
Let's say I have an array like:
arr = np.array([[1,20,5],
[1,20,8],
[3,10,4],
[2,30,6],
[3,10,5]])
and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}
. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.
Approach 1: Naively, I can invoke np.unique()
then iterate through each item in the unique array with a combination of np.where()
and np.sum()
in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.
Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.
Is there a fast way to implement Approach 2 without resorting to a pure Python loop?
numpy approach:
u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)
dict(np.stack([u, s]).T)
{1: 13, 2: 6, 3: 9}
pandas approach:
import pandas as pd
import numpy as np
pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()
{1: 13, 2: 6, 3: 9}
这篇关于以1个I / O传递方式提取numpy字典中最快的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!