将数组排序到索引数组指定的bin中的最有效方法? [英] Most efficient way to sort an array into bins specified by an index array?
问题描述
示例任务:
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])
预期结果:
binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])
约束:
-
应该很快.
Should be fast.
应为O(n+k)
,其中n是数据长度,k是容器数.
Should be O(n+k)
where n is the length of data and k is the number of bins.
应该稳定,即保留箱内的顺序.
Should be stable, i.e. order within bins is preserved.
明显的解决方案
data[np.argsort(idx, kind='stable')]
是O(n log n)
.
O(n+k)
解决方案
def sort_to_bins(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 1, int)
for i in range(idx.size):
cnts[idx[i] + 1] += 1
for i in range(1, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(data)
for i in range(data.size):
res[cnts[idx[i]]] = data[i]
cnts[idx[i]] += 1
return res
是循环且缓慢的.
在纯numpy
中是否有更好的方法? scipy
< pandas
< numba
/pythran
?
Is there a better method in pure numpy
< scipy
< pandas
< numba
/pythran
?
推荐答案
以下是一些解决方案:
-
仍然使用
np.argsort
,毕竟它是快速编译的代码.
Use
np.argsort
anyway, after all it is fast compiled code.
使用np.bincount
获取纸槽大小,使用np.argpartition
固定数量的纸槽为O(n)
.缺点:目前尚无稳定的算法,因此我们必须对每个bin进行排序.
Use np.bincount
to get the bin sizes and np.argpartition
which is O(n)
for fixed number of bins. Downside: currently, no stable algorithm is available, thus we have to sort each bin.
使用scipy.ndimage.measurements.labeled_comprehension
.这大致完成了所需的操作,但不知道如何实现.
Use scipy.ndimage.measurements.labeled_comprehension
. This does roughly what is required, but no idea how it is implemented.
使用pandas
.我是一个完整的pandas
菜鸟,所以我在这里用groupby
拼凑的东西可能不太理想.
Use pandas
. I'm a complete pandas
noob, so what I cobbled together here using groupby
may be suboptimal.
在压缩的稀疏行格式和压缩的稀疏列格式之间使用scipy.sparse
切换恰好实现了我们正在寻找的确切操作.
Use scipy.sparse
switching between compressed sparse row and compressed sparse column formats happens to implement the exact operation we are looking for.
在问题中的循环代码上使用pythran
(我相信numba
也可以使用).只需在numpy导入后插入顶部
Use pythran
(I'm sure numba
works as well) on the loopy code in the question. All that is required is to insert at the top after numpy import
.
#pythran export sort_to_bins(int[:], float[:], int)
然后编译
# pythran stb_pthr.py
基准测试100 bins,可变项目数:
Benchmarks 100 bins, variable number of items:
带回家:
如果您对numba
/pythran
没问题,那是可以走的路,如果没有,scipy.sparse
的缩放比例会很好.
If you are ok with numba
/pythran
that is the way to go, if not scipy.sparse
scales rather well.
代码:
import numpy as np
from scipy import sparse
from scipy.ndimage.measurements import labeled_comprehension
from stb_pthr import sort_to_bins as sort_to_bins_pythran
import pandas as pd
def sort_to_bins_pandas(idx, data, mx=-1):
df = pd.DataFrame.from_dict(data=data)
out = np.empty_like(data)
j = 0
for grp in df.groupby(idx).groups.values():
out[j:j+len(grp)] = data[np.sort(grp)]
j += len(grp)
return out
def sort_to_bins_ndimage(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
out = np.empty_like(data)
j = 0
def collect(bin):
nonlocal j
out[j:j+len(bin)] = np.sort(bin)
j += len(bin)
return 0
labeled_comprehension(data, idx, np.arange(mx), collect, data.dtype, None)
return out
def sort_to_bins_partition(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
return data[np.argpartition(idx, np.bincount(idx, None, mx)[:-1].cumsum())]
def sort_to_bins_partition_stable(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
split = np.bincount(idx, None, mx)[:-1].cumsum()
srt = np.argpartition(idx, split)
for bin in np.split(srt, split):
bin.sort()
return data[srt]
def sort_to_bins_sparse(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
return sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc().data
def sort_to_bins_argsort(idx, data, mx=-1):
return data[idx.argsort(kind='stable')]
from timeit import timeit
exmpls = [np.random.randint(0, K, (N,)) for K, N in np.c_[np.full(16, 100), 1<<np.arange(5, 21)]]
timings = {}
for idx in exmpls:
data = np.arange(len(idx), dtype=float)
ref = None
for x, f in (*globals().items(),):
if x.startswith('sort_to_bins_'):
timings.setdefault(x.replace('sort_to_bins_', '').replace('_', ' '), []).append(timeit('f(idx, data, -1)', globals={'f':f, 'idx':idx, 'data':data}, number=10)*100)
if x=='sort_to_bins_partition':
continue
if ref is None:
ref = f(idx, data, -1)
else:
assert np.all(f(idx, data, -1)==ref)
import pylab
for k, v in timings.items():
pylab.loglog(1<<np.arange(5, 21), v, label=k)
pylab.xlabel('#items')
pylab.ylabel('time [ms]')
pylab.legend()
pylab.show()
这篇关于将数组排序到索引数组指定的bin中的最有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!