如何使用numpy在线性时间内按唯一值获取累计计数? [英] How to use numpy to get the cumulative count by unique values in linear time?

查看:61
本文介绍了如何使用numpy在线性时间内按唯一值获取累计计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下列表short_listlong_list

short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
    np.random.choice(list(ascii_letters),
                     (10000, 2))
).sum(1).tolist()

如何通过唯一值计算累计计数?

How do I calculate the cumulative count by unique value?

我想使用numpy并在线性时间内进行.我希望这可以将计时与其他方法进行比较.用我提出的第一个解决方案可能最容易说明

I want to use numpy and do it in linear time. I want this to compare timings with my other methods. It may be easiest to illustrate with my first proposed solution

def pir1(l):
    s = pd.Series(l)
    return s.groupby(s).cumcount().tolist()

print(np.array(short_list))
print(pir1(short_list))

['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]

我尝试使用np.unique折磨了自己,因为它返回了一个counts数组,一个反向数组和一个index数组.我确定我可以解决这些问题.我得到的最好的结果是在pir4下,该值以二次时间缩放.另外请注意,我不在乎计数是从1还是从0开始,因为我们可以简单地加或减1.

I've tortured myself trying to use np.unique because it returns a counts array, an inverse array, and an index array. I was sure I could these to get at a solution. The best I got is in pir4 below which scales in quadratic time. Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1.

下面是我的一些尝试(没有一个能回答我的问题)

Below are some of my attempts (none of which answer my question)

%%cython
from collections import defaultdict

def get_generator(l):
    counter = defaultdict(lambda: -1)
    for i in l:
        counter[i] += 1
        yield counter[i]

def pir2(l):
    return [i for i in get_generator(l)]


def pir3(l):
    return [i for i in get_generator(l)]

def pir4(l):
    unq, inv = np.unique(l, 0, 1, 0)
    a = np.arange(len(unq))
    matches = a[:, None] == inv
    return (matches * matches.cumsum(1)).sum(0).tolist()

推荐答案

这是一种使用自定义分组范围创建函数和np.unique获取计数的矢量化方法-

Here's a vectorized approach using custom grouped range creating function and np.unique for getting the counts -

def grp_range(a):
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    return id_arr.cumsum()

count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]

样品运行-

In [117]: A = list('aaabaaacaaadaaac')

In [118]: count = np.unique(A,return_counts=1)[1]
     ...: out = grp_range(count)[np.argsort(A).argsort()]
     ...: 

In [119]: out
Out[119]: array([ 0,  1,  2,  0,  3,  4,  5,  0,  6,  7,  8,  0,  9, 10, 11,  1])

要获得count,可以提出一些其他针对性能的替代方案-

For getting the count, few other alternatives could be proposed with focus on performance -

np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)

另外,如果A包含single-letter个字符,我们可以简单地使用-

Additionally, with A containing single-letter characters, we could get the count simply with -

np.bincount(np.array(A).view('uint8')-97)

这篇关于如何使用numpy在线性时间内按唯一值获取累计计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆