获取每个2d数组的累积计数 [英] Get cumulative count per 2d array
问题描述
我有一般数据,例如字符串:
I have general data, e.g. strings:
np.random.seed(343)
arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str)
print (arr)
[['0' '1' '1' '2' '2' '3' '3' '4' '4' '4']
['1' '2' '2' '2' '3' '3' '3' '4' '4' '4']
['0' '2' '2' '2' '2' '3' '3' '4' '4' '4']
['0' '1' '2' '2' '3' '3' '3' '4' '4' '4']
['0' '1' '1' '1' '2' '2' '2' '2' '4' '4']
['0' '0' '1' '1' '2' '3' '3' '3' '4' '4']
['0' '0' '2' '2' '2' '2' '2' '2' '3' '4']
['0' '0' '1' '1' '1' '2' '2' '2' '3' '3']
['0' '1' '1' '2' '2' '2' '3' '4' '4' '4']
['0' '1' '1' '2' '2' '2' '2' '2' '4' '4']]
如果累计值的计数器存在差异,我需要重新设置计数,所以使用大熊猫.
I need count with reset if difference for counter of cumulative values, so is used pandas.
首先创建DataFrame:
First create DataFrame:
df = pd.DataFrame(arr)
print (df)
0 1 2 3 4 5 6 7 8 9
0 0 1 1 2 2 3 3 4 4 4
1 1 2 2 2 3 3 3 4 4 4
2 0 2 2 2 2 3 3 4 4 4
3 0 1 2 2 3 3 3 4 4 4
4 0 1 1 1 2 2 2 2 4 4
5 0 0 1 1 2 3 3 3 4 4
6 0 0 2 2 2 2 2 2 3 4
7 0 0 1 1 1 2 2 2 3 3
8 0 1 1 2 2 2 3 4 4 4
9 0 1 1 2 2 2 2 2 4 4
它如何作用于一列:
How it working for one column:
首先比较转移的数据并添加累加和:
First compare shifted data and add cumulative sum:
a = (df[0] != df[0].shift()).cumsum()
print (a)
0 1
1 2
2 3
3 3
4 3
5 3
6 3
7 3
8 3
9 3
Name: 0, dtype: int32
然后调用 GroupBy.cumcount
:
b = a.groupby(a).cumcount() + 1
print (b)
0 1
1 1
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
dtype: int64
如果要对所有列应用解决方案,请使用apply
:
If want apply solution to all columns is possible use apply
:
print (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
0 1 2 3 4 5 6 7 8 9
0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 2 1 2 2 2 2 2
2 1 2 2 3 1 3 3 3 3 3
3 2 1 3 4 1 4 4 4 4 4
4 3 2 1 1 1 1 1 1 5 5
5 4 1 2 2 2 1 1 1 6 6
6 5 2 1 1 3 1 1 1 1 7
7 6 3 1 1 1 2 2 2 2 1
8 7 1 2 1 1 3 1 1 1 1
9 8 2 3 2 2 4 1 1 2 2
但是它很慢,因为数据量很大.是否可以创建一些快速的numpy解决方案?
But it is slow, because large data. Is possible create some fast numpy solution?
我发现解决方案仅适用于一维数组.
I find solutions working only for 1d array.
推荐答案
还有numba解决方案.对于这种棘手的问题,它总是胜出,这里是numpy的7倍,因为仅完成了一次传递即可.
And the numba solution. For such tricky problem, it always wins, here by a 7x factor vs numpy, since only one pass on res is done.
from numba import njit
@njit
def thefunc(arrc):
m,n=arrc.shape
res=np.empty((m+1,n),np.uint32)
res[0]=1
for i in range(1,m+1):
for j in range(n):
if arrc[i-1,j]:
res[i,j]=res[i-1,j]+1
else : res[i,j]=1
return res
def numbering(arr):return thefunc(arr[1:]==arr[:-1])
我需要外部化arr[1:]==arr[:-1]
,因为numba不支持字符串.
I need to externalize arr[1:]==arr[:-1]
since numba doesn't support strings.
In [75]: %timeit numbering(arr)
13.7 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [76]: %timeit grp_range_2dcol(arr)
111 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
对于更大的数组(100 000行x 100列),间距不是那么宽:
For bigger array (100 000 rows x 100 cols), the gap is not so wide :
In [168]: %timeit a=grp_range_2dcol(arr)
1.54 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [169]: %timeit a=numbering(arr)
625 ms ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
如果arr
可以转换为'S8',我们可以赢得很多时间:
If arr
can be convert to 'S8', we can win a lot of time :
In [398]: %timeit arr[1:]==arr[:-1]
584 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [399]: %timeit arr.view(np.uint64)[1:]==arr.view(np.uint64)[:-1]
196 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
这篇关于获取每个2d数组的累积计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!