NumPy-按频率快速稳定地对大型数组进行arg排序 [英] NumPy - fast stable arg-sort of large array by frequency

查看:75
本文介绍了NumPy-按频率快速稳定地对大型数组进行arg排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有任何类似的dtype的大型1D NumPy 数组a,它的某些元素可能是重复.

I have large 1D NumPy array a of any comparable dtype, some of its elements may be repeated.

我如何找到可以稳定排序的排序索引ix(稳定性此处描述的意义)a是按值的降序还是升序?

How do I find sorting indexes ix that will stable-sort (stability in a sense described here) a by frequencies of values in descending/ascending orders?

我想找到最快,最简单的方法.也许现有的标准numpy函数可以做到这一点.

I want to find fastest and simplest way to do this. Maybe there is existing standard numpy function to do that.

这里还有另一个相关的问题,但是专门要求删除数组重复项,即仅输出唯一的排序值,我需要原始数组的所有值,包括重复项.

There is another related question here but it was asking specifically to remove arrays duplicates, i.e. output only unique sorted values, I need all values of original array including duplicates.

我已经编写了第一个试用版来完成该任务,但是它不是最快的(使用Python的循环),并且可能不是最短/最简单的形式.如果相等元素的重复次数不高且数组很大,则此python循环可能会非常昂贵.如果在NumPy中可用(例如,虚构的np.argsort_by_freq()),则具有简短的功能可以很好地完成这一切也是很好的选择.

I've coded my first trial to do the task, but it is not the fastest (uses Python's loop) and probably not shortest/simplest possible form. This python loop can be very expensive if repeating of equal elements is not high and array is huge. Also would be nice to have short function for doing this all if available in NumPy (e.g. imaginary np.argsort_by_freq()).

在线试用!

import numpy as np
np.random.seed(1)
hi, n, desc = 7, 24, True
a = np.random.choice(np.arange(hi), (n,), p = (
    lambda p = np.random.random((hi,)): p / p.sum()
)())
us, cs = np.unique(a, return_counts = True)
af = np.zeros(n, dtype = np.int64)
for u, c in zip(us, cs):
    af[a == u] = c
if desc:
    ix = np.argsort(-af, kind = 'stable') # Descending sort
else:
    ix = np.argsort(af, kind = 'stable') # Ascending sort
print('rows: i_col(0) / original_a(1) / freqs(2) / sorted_a(3)')
print('    / sorted_freqs(4) / sorting_ix(5)')
print(np.stack((
    np.arange(n), a, af, a[ix], af[ix], ix,
), 0))

输出:

rows: i_col(0) / original_a(1) / freqs(2) / sorted_a(3)
    / sorted_freqs(4) / sorting_ix(5)
[[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
 [ 1  1  1  1  3  0  5  0  3  1  1  0  0  4  6  1  3  5  5  0  0  0  5  0]
 [ 7  7  7  7  3  8  4  8  3  7  7  8  8  1  1  7  3  4  4  8  8  8  4  8]
 [ 0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  5  5  5  5  3  3  3  4  6]
 [ 8  8  8  8  8  8  8  8  7  7  7  7  7  7  7  4  4  4  4  3  3  3  1  1]
 [ 5  7 11 12 19 20 21 23  0  1  2  3  9 10 15  6 17 18 22  4  8 16 13 14]]

推荐答案

我只是想出自己很简单的解决方案,只需使用numpy函数即可实现任何dtype,而无需使用python循环,它可以在O(N log N)时间内运行.使用的numpy函数:np.uniquenp.argsort和数组索引.

I just figured myself probably very fast solution for any dtype using just numpy functions without python looping, it works in O(N log N) time. Used numpy functions: np.unique, np.argsort and array indexing.

尽管在原始问题中没有问过,但我实现了额外的标志equal_order_by_val,如果它为False,则将具有相同频率的数组元素排序为相等的稳定范围,这意味着可能存在c d d c d c输出,如下面的输出转储中所示,因为这是元素以相同频率进入原始数组的顺序.当flag为True时,此类元素还按原始数组的值排序,结果为c c c d d d.换句话说,在为False的情况下,我们仅按键freq进行稳定排序,当为True时,我们按(freq, value)进行升序排序,并按(-freq, value)进行降序排序.

Although wasn't asked in original question, I implemented extra flag equal_order_by_val if it is False then array elements with same frequencies are sorted as equal stable range, meaning that there could be c d d c d c output like in outputs dumps below, because this is the order as elements go in original array for equal frequency. When flag is True such elements are in addition sorted by value of original array, resulting in c c c d d d. In other words in case of False we sort stably just by key freq, and when it is True we sort by (freq, value) for ascending order and by (-freq, value) for descending order.

在线试玩

import string, math
import numpy as np
np.random.seed(0)

# Generating input data

hi, n, desc = 7, 25, True
letters = np.array(list(string.ascii_letters), dtype = np.object_)[:hi]
a = np.random.choice(letters, (n,), p = (
    lambda p = np.random.random((letters.size,)): p / p.sum()
)())

for equal_order_by_val in [False, True]:
    # Solving task

    us, ui, cs = np.unique(a, return_inverse = True, return_counts = True)
    af = cs[ui]
    sort_key = -af if desc else af
    if equal_order_by_val:
        shift_bits = max(1, math.ceil(math.log(us.size) / math.log(2)))
        sort_key = ((sort_key.astype(np.int64) << shift_bits) +
            np.arange(us.size, dtype = np.int64)[ui])
    ix = np.argsort(sort_key, kind = 'stable') # Do sorting itself

    # Printing results

    print('\nequal_order_by_val:', equal_order_by_val)
    for name, val in [
        ('i_col', np.arange(n)),  ('original_a', a),
        ('freqs', af),            ('sorted_a', a[ix]),
        ('sorted_freqs', af[ix]), ('sorting_ix', ix),
    ]:
        print(name.rjust(12), ' '.join([str(e).rjust(2) for e in val]))

输出:

equal_order_by_val: False
       i_col  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
  original_a  g  g  c  f  d  d  g  a  a  a  f  f  f  g  f  c  f  a  e  b  g  d  c  b  f
       freqs  5  5  3  7  3  3  5  4  4  4  7  7  7  5  7  3  7  4  1  2  5  3  3  2  7
    sorted_a  f  f  f  f  f  f  f  g  g  g  g  g  a  a  a  a  c  d  d  c  d  c  b  b  e
sorted_freqs  7  7  7  7  7  7  7  5  5  5  5  5  4  4  4  4  3  3  3  3  3  3  2  2  1
  sorting_ix  3 10 11 12 14 16 24  0  1  6 13 20  7  8  9 17  2  4  5 15 21 22 19 23 18

equal_order_by_val: True
       i_col  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
  original_a  g  g  c  f  d  d  g  a  a  a  f  f  f  g  f  c  f  a  e  b  g  d  c  b  f
       freqs  5  5  3  7  3  3  5  4  4  4  7  7  7  5  7  3  7  4  1  2  5  3  3  2  7
    sorted_a  f  f  f  f  f  f  f  g  g  g  g  g  a  a  a  a  c  c  c  d  d  d  b  b  e
sorted_freqs  7  7  7  7  7  7  7  5  5  5  5  5  4  4  4  4  3  3  3  3  3  3  2  2  1
  sorting_ix  3 10 11 12 14 16 24  0  1  6 13 20  7  8  9 17  2 15 22  4  5 21 19 23 18

这篇关于NumPy-按频率快速稳定地对大型数组进行arg排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆