将数组排序到索引数组指定的bin中的最有效方法? [英] Most efficient way to sort an array into bins specified by an index array?

查看:79
本文介绍了将数组排序到索引数组指定的bin中的最有效方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

示例任务:

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx  = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])

预期结果:

binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])

约束:

  • 应该很快.

  • Should be fast.

应为O(n+k),其中n是数据长度,k是容器数.

Should be O(n+k) where n is the length of data and k is the number of bins.

应该稳定,即保留箱内的顺序.

Should be stable, i.e. order within bins is preserved.

明显的解决方案

data[np.argsort(idx, kind='stable')]

O(n log n).

O(n+k)解决方案

def sort_to_bins(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    cnts = np.zeros(mx + 1, int)
    for i in range(idx.size):
        cnts[idx[i] + 1] += 1
    for i in range(1, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(data)
    for i in range(data.size):
        res[cnts[idx[i]]] = data[i]
        cnts[idx[i]] += 1
    return res

是循环且缓慢的.

在纯numpy中是否有更好的方法? scipy< pandas< numba/pythran?

Is there a better method in pure numpy < scipy < pandas < numba/pythran?

推荐答案

以下是一些解决方案:

  1. 仍然使用np.argsort,毕竟它是快速编译的代码.

  1. Use np.argsort anyway, after all it is fast compiled code.

使用np.bincount获取纸槽大小,使用np.argpartition固定数量的纸槽为O(n).缺点:目前尚无稳定的算法,因此我们必须对每个bin进行排序.

Use np.bincount to get the bin sizes and np.argpartition which is O(n) for fixed number of bins. Downside: currently, no stable algorithm is available, thus we have to sort each bin.

使用scipy.ndimage.measurements.labeled_comprehension.这大致完成了所需的操作,但不知道如何实现.

Use scipy.ndimage.measurements.labeled_comprehension. This does roughly what is required, but no idea how it is implemented.

使用pandas.我是一个完整的pandas菜鸟,所以我在这里用groupby拼凑的东西可能不太理想.

Use pandas. I'm a complete pandas noob, so what I cobbled together here using groupby may be suboptimal.

在压缩的稀疏行格式和压缩的稀疏列格式之间使用scipy.sparse切换恰好实现了我们正在寻找的确切操作.

Use scipy.sparse switching between compressed sparse row and compressed sparse column formats happens to implement the exact operation we are looking for.

在问题中的循环代码上使用pythran(我相信numba也可以使用).只需在numpy导入后插入顶部

Use pythran (I'm sure numba works as well) on the loopy code in the question. All that is required is to insert at the top after numpy import

.

#pythran export sort_to_bins(int[:], float[:], int)

然后编译

# pythran stb_pthr.py

基准测试100 bins,可变项目数:

Benchmarks 100 bins, variable number of items:

带回家:

如果您对numba/pythran没问题,那是可以走的路,如果没有,scipy.sparse的缩放比例会很好.

If you are ok with numba/pythran that is the way to go, if not scipy.sparse scales rather well.

代码:

import numpy as np
from scipy import sparse
from scipy.ndimage.measurements import labeled_comprehension
from stb_pthr import sort_to_bins as sort_to_bins_pythran
import pandas as pd

def sort_to_bins_pandas(idx, data, mx=-1):
    df = pd.DataFrame.from_dict(data=data)
    out = np.empty_like(data)
    j = 0
    for grp in df.groupby(idx).groups.values():
        out[j:j+len(grp)] = data[np.sort(grp)]
        j += len(grp)
    return out

def sort_to_bins_ndimage(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    out = np.empty_like(data)
    j = 0
    def collect(bin):
        nonlocal j
        out[j:j+len(bin)] = np.sort(bin)
        j += len(bin)
        return 0
    labeled_comprehension(data, idx, np.arange(mx), collect, data.dtype, None)
    return out

def sort_to_bins_partition(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    return data[np.argpartition(idx, np.bincount(idx, None, mx)[:-1].cumsum())]

def sort_to_bins_partition_stable(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    split = np.bincount(idx, None, mx)[:-1].cumsum()
    srt = np.argpartition(idx, split)
    for bin in np.split(srt, split):
        bin.sort()
    return data[srt]

def sort_to_bins_sparse(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1    
    return sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc().data

def sort_to_bins_argsort(idx, data, mx=-1):
    return data[idx.argsort(kind='stable')]

from timeit import timeit
exmpls = [np.random.randint(0, K, (N,)) for K, N in np.c_[np.full(16, 100), 1<<np.arange(5, 21)]]

timings = {}
for idx in exmpls:
    data = np.arange(len(idx), dtype=float)
    ref = None
    for x, f in (*globals().items(),):
        if x.startswith('sort_to_bins_'):
            timings.setdefault(x.replace('sort_to_bins_', '').replace('_', ' '), []).append(timeit('f(idx, data, -1)', globals={'f':f, 'idx':idx, 'data':data}, number=10)*100)
            if x=='sort_to_bins_partition':
                continue
            if ref is None:
                ref = f(idx, data, -1)
            else:
                assert np.all(f(idx, data, -1)==ref)

import pylab
for k, v in timings.items():
    pylab.loglog(1<<np.arange(5, 21), v, label=k)
pylab.xlabel('#items')
pylab.ylabel('time [ms]')
pylab.legend()
pylab.show()

这篇关于将数组排序到索引数组指定的bin中的最有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆