获取每个2d数组的累积计数 [英] Get cumulative count per 2d array

查看：78 发布时间：2020/5/18 20:14:23 python arrays numpy counter cumulative-sum

本文介绍了获取每个2d数组的累积计数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一般数据，例如字符串:

I have general data, e.g. strings:

np.random.seed(343)

arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str)
print (arr)
[['0' '1' '1' '2' '2' '3' '3' '4' '4' '4']
 ['1' '2' '2' '2' '3' '3' '3' '4' '4' '4']
 ['0' '2' '2' '2' '2' '3' '3' '4' '4' '4']
 ['0' '1' '2' '2' '3' '3' '3' '4' '4' '4']
 ['0' '1' '1' '1' '2' '2' '2' '2' '4' '4']
 ['0' '0' '1' '1' '2' '3' '3' '3' '4' '4']
 ['0' '0' '2' '2' '2' '2' '2' '2' '3' '4']
 ['0' '0' '1' '1' '1' '2' '2' '2' '3' '3']
 ['0' '1' '1' '2' '2' '2' '3' '4' '4' '4']
 ['0' '1' '1' '2' '2' '2' '2' '2' '4' '4']]

如果累计值的计数器存在差异，我需要重新设置计数，所以使用大熊猫.

I need count with reset if difference for counter of cumulative values, so is used pandas.

首先创建DataFrame:

First create DataFrame:

df = pd.DataFrame(arr)
print (df)
   0  1  2  3  4  5  6  7  8  9
0  0  1  1  2  2  3  3  4  4  4
1  1  2  2  2  3  3  3  4  4  4
2  0  2  2  2  2  3  3  4  4  4
3  0  1  2  2  3  3  3  4  4  4
4  0  1  1  1  2  2  2  2  4  4
5  0  0  1  1  2  3  3  3  4  4
6  0  0  2  2  2  2  2  2  3  4
7  0  0  1  1  1  2  2  2  3  3
8  0  1  1  2  2  2  3  4  4  4
9  0  1  1  2  2  2  2  2  4  4

它如何作用于一列:

How it working for one column:

首先比较转移的数据并添加累加和:

First compare shifted data and add cumulative sum:

a = (df[0] != df[0].shift()).cumsum()
print (a)
0    1
1    2
2    3
3    3
4    3
5    3
6    3
7    3
8    3
9    3
Name: 0, dtype: int32

然后调用 GroupBy.cumcount :

b = a.groupby(a).cumcount() + 1
print (b)
0    1
1    1
2    1
3    2
4    3
5    4
6    5
7    6
8    7
9    8
dtype: int64

如果要对所有列应用解决方案，请使用apply:

If want apply solution to all columns is possible use apply:

print (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
   0  1  2  3  4  5  6  7  8  9
0  1  1  1  1  1  1  1  1  1  1
1  1  1  1  2  1  2  2  2  2  2
2  1  2  2  3  1  3  3  3  3  3
3  2  1  3  4  1  4  4  4  4  4
4  3  2  1  1  1  1  1  1  5  5
5  4  1  2  2  2  1  1  1  6  6
6  5  2  1  1  3  1  1  1  1  7
7  6  3  1  1  1  2  2  2  2  1
8  7  1  2  1  1  3  1  1  1  1
9  8  2  3  2  2  4  1  1  2  2

但是它很慢，因为数据量很大.是否可以创建一些快速的numpy解决方案?

But it is slow, because large data. Is possible create some fast numpy solution?

我发现解决方案仅适用于一维数组.

I find solutions working only for 1d array.

推荐答案

还有numba解决方案.对于这种棘手的问题，它总是胜出，这里是numpy的7倍，因为仅完成了一次传递即可.

And the numba solution. For such tricky problem, it always wins, here by a 7x factor vs numpy, since only one pass on res is done.

from numba import njit 
@njit
def thefunc(arrc):
    m,n=arrc.shape
    res=np.empty((m+1,n),np.uint32)
    res[0]=1
    for i in range(1,m+1):
        for j in range(n):
            if arrc[i-1,j]:
                res[i,j]=res[i-1,j]+1
            else : res[i,j]=1
    return res 

def numbering(arr):return thefunc(arr[1:]==arr[:-1])

我需要外部化arr[1:]==arr[:-1]，因为numba不支持字符串.

I need to externalize arr[1:]==arr[:-1] since numba doesn't support strings.

In [75]: %timeit numbering(arr)
13.7 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [76]: %timeit grp_range_2dcol(arr)
111 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

对于更大的数组(100 000行x 100列)，间距不是那么宽:

For bigger array (100 000 rows x 100 cols), the gap is not so wide :

In [168]: %timeit a=grp_range_2dcol(arr)
1.54 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [169]: %timeit a=numbering(arr)
625 ms ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如果arr可以转换为'S8'，我们可以赢得很多时间:

If arr can be convert to 'S8', we can win a lot of time :

In [398]: %timeit arr[1:]==arr[:-1]
584 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [399]: %timeit arr.view(np.uint64)[1:]==arr.view(np.uint64)[:-1]
196 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这篇关于获取每个2d数组的累积计数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

获取每个2d数组的累积计数 [英] Get cumulative count per 2d array

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

获取每个2d数组的累积计数 [英] Get cumulative count per 2d array

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭