以1个I / O传递方式提取numpy字典中最快的方式 [英] Fastest way to extract dictionary of sums in numpy in 1 I/O pass

查看：117 发布时间：2017/5/21 21:32:56 python numpy pandas dictionary vectorization

本文介绍了以1个I / O传递方式提取numpy字典中最快的方式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个数组：

  arr = np.array（[[1,20,5] 
 [1,20,8]，
 [3,10,4]，
 [2,30,6]，
 [3,10,5]]）

，我想形成一个字符串，为每行匹配每个值的第三列的总和在第一列中，即返回 {1：13，2：6，3：9} 。为了使事情变得更具挑战性，我的阵列中有10亿行，第一列有10万行。

方法1：可以调用 np.unique（）然后通过 np.where（）和 np.sum（）在包含列表理解的单行字典中。如果我有一小部分独特的元素，这将是相当快的，但是在100k独特的元素，我会招致大量浪费的页面抓取，使整个阵列的100k I / O通行。

方法2：我可以进行最后一列的单个I / O传递（因为每行的散列第1列可能比过多的页面提取便宜），但是我失去了numpy的C内圈向量化在这里。

有没有快速的方法来实现方法2而不诉诸纯Python循环？

解决方案

numpy方法：

  u = np.unique（arr [ ，0]）
s =（（arr [:, [0]] == u）* arr [:, [2]]）sum（0）
 
 dict（np。 stack（[u，s]）。T）
 
 {1：13，2：6，3：9}

熊猫方法：

 将大熊猫导入为pd 
导入numpy为np 
 
 pd.DataFrame（arr，columns = list（'ABC'））。groupby（'A'）。C.sum（）。to_dict（） 
 
 {1：13，2：6，3：9}

Let's say I have an array like:

arr = np.array([[1,20,5],
                [1,20,8],
                [3,10,4],
                [2,30,6],
                [3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?

解决方案

numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)

dict(np.stack([u, s]).T)

{1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as np

pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()

{1: 13, 2: 6, 3: 9}

这篇关于以1个I / O传递方式提取numpy字典中最快的方式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

以1个I / O传递方式提取numpy字典中最快的方式 [英] Fastest way to extract dictionary of sums in numpy in 1 I/O pass

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

以1个I / O传递方式提取numpy字典中最快的方式 [英] Fastest way to extract dictionary of sums in numpy in 1 I/O pass

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭