如何使用numpy在线性时间内按唯一值获取累计计数? [英] How to use numpy to get the cumulative count by unique values in linear time?
问题描述
请考虑以下列表short_list
和long_list
short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
np.random.choice(list(ascii_letters),
(10000, 2))
).sum(1).tolist()
如何通过唯一值计算累计计数?
How do I calculate the cumulative count by unique value?
我想使用numpy并在线性时间内进行.我希望这可以将计时与其他方法进行比较.用我提出的第一个解决方案可能最容易说明
I want to use numpy and do it in linear time. I want this to compare timings with my other methods. It may be easiest to illustrate with my first proposed solution
def pir1(l):
s = pd.Series(l)
return s.groupby(s).cumcount().tolist()
print(np.array(short_list))
print(pir1(short_list))
['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]
我尝试使用np.unique
折磨了自己,因为它返回了一个counts数组,一个反向数组和一个index数组.我确定我可以解决这些问题.我得到的最好的结果是在pir4
下,该值以二次时间缩放.另外请注意,我不在乎计数是从1还是从0开始,因为我们可以简单地加或减1.
I've tortured myself trying to use np.unique
because it returns a counts array, an inverse array, and an index array. I was sure I could these to get at a solution. The best I got is in pir4
below which scales in quadratic time. Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1.
下面是我的一些尝试(没有一个能回答我的问题)
Below are some of my attempts (none of which answer my question)
%%cython
from collections import defaultdict
def get_generator(l):
counter = defaultdict(lambda: -1)
for i in l:
counter[i] += 1
yield counter[i]
def pir2(l):
return [i for i in get_generator(l)]
def pir3(l):
return [i for i in get_generator(l)]
def pir4(l):
unq, inv = np.unique(l, 0, 1, 0)
a = np.arange(len(unq))
matches = a[:, None] == inv
return (matches * matches.cumsum(1)).sum(0).tolist()
推荐答案
这是一种使用自定义分组范围创建函数和np.unique
获取计数的矢量化方法-
Here's a vectorized approach using custom grouped range creating function and np.unique
for getting the counts -
def grp_range(a):
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
return id_arr.cumsum()
count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]
样品运行-
In [117]: A = list('aaabaaacaaadaaac')
In [118]: count = np.unique(A,return_counts=1)[1]
...: out = grp_range(count)[np.argsort(A).argsort()]
...:
In [119]: out
Out[119]: array([ 0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1])
要获得count
,可以提出一些其他针对性能的替代方案-
For getting the count
, few other alternatives could be proposed with focus on performance -
np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)
另外,如果A
包含single-letter
个字符,我们可以简单地使用-
Additionally, with A
containing single-letter
characters, we could get the count simply with -
np.bincount(np.array(A).view('uint8')-97)
这篇关于如何使用numpy在线性时间内按唯一值获取累计计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!