将数组排序到由索引数组指定的 bin 的最有效方法? [英] Most efficient way to sort an array into bins specified by an index array?

查看:9
本文介绍了将数组排序到由索引数组指定的 bin 的最有效方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任务示例:

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])idx = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])

预期结果:

binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])

约束:

  • 应该很快.

  • 应该是 O(n+k) 其中 n 是数据的长度,k 是 bin 的数量.

  • 应该是稳定的,即保持 bin 内的顺序.

明显的解决方案

data[np.argsort(idx, kind='stable')]

O(n log n).

O(n+k)

def sort_to_bins(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1cnts = np.zeros(mx + 1, int)对于我在范围内(idx.size):cnts[idx[i] + 1] += 1对于范围内的 i(1,cnts.size):cnts[i] += cnts[i-1]res = np.empty_like(数据)对于我在范围内(数据大小):res[cnts[idx[i]]] = 数据[i]cnts[idx[i]] += 1返回资源

又乱又慢.

numpy有没有更好的方法<scipy 熊猫 numba/pythran?

解决方案

这里有几个解决方案:

  1. 无论如何使用np.argsort,毕竟它是快速编译的代码.

  2. 使用 np.bincount 获取 bin 大小和 np.argpartitionO(n) 对于固定数量的垃圾箱.缺点:目前没有稳定的算法可用,因此我们必须对每个 bin 进行排序.

  3. 使用 scipy.ndimage.measurements.labeled_comprehension.这大致满足了要求,但不知道它是如何实现的.

  4. 使用pandas.我是一个完整的 pandas 菜鸟,所以我在这里使用 groupby 拼凑的东西可能不是最理想的.

  5. 使用scipy.sparse在压缩稀疏行和压缩稀疏列格式之间切换恰好实现了我们正在寻找的确切操作.

  6. 在问题中的循环代码上使用 pythran(我确定 numba 也能正常工作).所需要做的就是在 numpy 导入后在顶部插入

.

#pythran export sort_to_bins(int[:], float[:], int)

然后编译

# pythran stb_pthr.py

基准 100 个箱子,可变数量的项目:

带回家:

如果你对 numba/pythran 没问题,那就是要走的路,如果不是,scipy.sparse 可以很好地扩展.>

代码:

将 numpy 导入为 np从 scipy 导入稀疏从 scipy.ndimage.measurements 导入labeled_comprehension从 stb_pthr 导入 sort_to_bins 作为 sort_to_bins_pythran将熊猫导入为 pddef sort_to_bins_pandas(idx, data, mx=-1):df = pd.DataFrame.from_dict(data=data)out = np.empty_like(data)j = 0对于 df.groupby(idx).groups.values() 中的 grp:out[j:j+len(grp)] = 数据[np.sort(grp)]j += len(grp)回来def sort_to_bins_ndimage(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1out = np.empty_like(data)j = 0定义收集(bin):非局部 jout[j:j+len(bin)] = np.sort(bin)j += len(bin)返回 0标记理解(数据,idx,np.arange(mx),收集,data.dtype,无)回来def sort_to_bins_partition(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1返回数据[np.argpartition(idx, np.bincount(idx, None, mx)[:-1].cumsum())]def sort_to_bins_partition_stable(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1split = np.bincount(idx, None, mx)[:-1].cumsum()srt = np.argpartition(idx, split)对于 np.split(srt, split) 中的 bin:bin.sort()返回数据[srt]def sort_to_bins_sparse(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1返回 sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc().datadef sort_to_bins_argsort(idx, data, mx=-1):返回数据[idx.argsort(kind='stable')]从时间导入时间exmpls = [np.random.randint(0, K, (N,)) for K, N in np.c_[np.full(16, 100), 1<

The task by example:

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx  = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])

Expected result:

binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])

Constraints:

  • Should be fast.

  • Should be O(n+k) where n is the length of data and k is the number of bins.

  • Should be stable, i.e. order within bins is preserved.

Obvious solution

data[np.argsort(idx, kind='stable')]

is O(n log n).

O(n+k) solution

def sort_to_bins(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    cnts = np.zeros(mx + 1, int)
    for i in range(idx.size):
        cnts[idx[i] + 1] += 1
    for i in range(1, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(data)
    for i in range(data.size):
        res[cnts[idx[i]]] = data[i]
        cnts[idx[i]] += 1
    return res

is loopy and slow.

Is there a better method in pure numpy < scipy < pandas < numba/pythran?

解决方案

Here are a few solutions:

  1. Use np.argsort anyway, after all it is fast compiled code.

  2. Use np.bincount to get the bin sizes and np.argpartition which is O(n) for fixed number of bins. Downside: currently, no stable algorithm is available, thus we have to sort each bin.

  3. Use scipy.ndimage.measurements.labeled_comprehension. This does roughly what is required, but no idea how it is implemented.

  4. Use pandas. I'm a complete pandas noob, so what I cobbled together here using groupby may be suboptimal.

  5. Use scipy.sparse switching between compressed sparse row and compressed sparse column formats happens to implement the exact operation we are looking for.

  6. Use pythran (I'm sure numba works as well) on the loopy code in the question. All that is required is to insert at the top after numpy import

.

#pythran export sort_to_bins(int[:], float[:], int)

and then compile

# pythran stb_pthr.py

Benchmarks 100 bins, variable number of items:

Take home:

If you are ok with numba/pythran that is the way to go, if not scipy.sparse scales rather well.

Code:

import numpy as np
from scipy import sparse
from scipy.ndimage.measurements import labeled_comprehension
from stb_pthr import sort_to_bins as sort_to_bins_pythran
import pandas as pd

def sort_to_bins_pandas(idx, data, mx=-1):
    df = pd.DataFrame.from_dict(data=data)
    out = np.empty_like(data)
    j = 0
    for grp in df.groupby(idx).groups.values():
        out[j:j+len(grp)] = data[np.sort(grp)]
        j += len(grp)
    return out

def sort_to_bins_ndimage(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    out = np.empty_like(data)
    j = 0
    def collect(bin):
        nonlocal j
        out[j:j+len(bin)] = np.sort(bin)
        j += len(bin)
        return 0
    labeled_comprehension(data, idx, np.arange(mx), collect, data.dtype, None)
    return out

def sort_to_bins_partition(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    return data[np.argpartition(idx, np.bincount(idx, None, mx)[:-1].cumsum())]

def sort_to_bins_partition_stable(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    split = np.bincount(idx, None, mx)[:-1].cumsum()
    srt = np.argpartition(idx, split)
    for bin in np.split(srt, split):
        bin.sort()
    return data[srt]

def sort_to_bins_sparse(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1    
    return sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc().data

def sort_to_bins_argsort(idx, data, mx=-1):
    return data[idx.argsort(kind='stable')]

from timeit import timeit
exmpls = [np.random.randint(0, K, (N,)) for K, N in np.c_[np.full(16, 100), 1<<np.arange(5, 21)]]

timings = {}
for idx in exmpls:
    data = np.arange(len(idx), dtype=float)
    ref = None
    for x, f in (*globals().items(),):
        if x.startswith('sort_to_bins_'):
            timings.setdefault(x.replace('sort_to_bins_', '').replace('_', ' '), []).append(timeit('f(idx, data, -1)', globals={'f':f, 'idx':idx, 'data':data}, number=10)*100)
            if x=='sort_to_bins_partition':
                continue
            if ref is None:
                ref = f(idx, data, -1)
            else:
                assert np.all(f(idx, data, -1)==ref)

import pylab
for k, v in timings.items():
    pylab.loglog(1<<np.arange(5, 21), v, label=k)
pylab.xlabel('#items')
pylab.ylabel('time [ms]')
pylab.legend()
pylab.show()

这篇关于将数组排序到由索引数组指定的 bin 的最有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆