以1个I / O传递方式提取numpy字典中最快的方式 [英] Fastest way to extract dictionary of sums in numpy in 1 I/O pass

查看:117
本文介绍了以1个I / O传递方式提取numpy字典中最快的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个数组:

  arr = np.array([[1,20,5] 
[1,20,8],
[3,10,4],
[2,30,6],
[3,10,5]])

,我想形成一个字符串,为每行匹配每个值的第三列的总和在第一列中,即返回 {1:13,2:6,3:9} 。为了使事情变得更具挑战性,我的阵列中有10亿行,第一列有10万行。



方法1:可以调用 np.unique()然后通过 np.where() np.sum()在包含列表理解的单行字典中。如果我有一小部分独特的元素,这将是相当快的,但是在100k独特的元素,我会招致大量浪费的页面抓取,使整个阵列的100k I / O通行。



方法2:我可以进行最后一列的单个I / O传递(因为每行的散列第1列可能比过多的页面提取便宜),但是我失去了numpy的C内圈向量化在这里。



有没有快速的方法来实现方法2而不诉诸纯Python循环?

解决方案

numpy方法:

  u = np.unique(arr [ ,0])
s =((arr [:, [0]] == u)* arr [:, [2]])sum(0)

dict(np。 stack([u,s])。T)

{1:13,2:6,3:9}

熊猫方法:

 将大熊猫导入为pd 
导入numpy为np

pd.DataFrame(arr,columns = list('ABC'))。groupby('A')。C.sum()。to_dict()

{1:13,2:6,3:9}


Let's say I have an array like:

arr = np.array([[1,20,5],
                [1,20,8],
                [3,10,4],
                [2,30,6],
                [3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?

解决方案

numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)

dict(np.stack([u, s]).T)

{1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as np

pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()

{1: 13, 2: 6, 3: 9}

这篇关于以1个I / O传递方式提取numpy字典中最快的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆