用numpy进行向量化的基数排序-它可以击败np.sort吗? [英] vectorized radix sort with numpy - can it beat np.sort?
问题描述
Numpy还没有还没有具有基数排序功能,所以我想知道是否可以使用预先存在的numpy函数编写一个.到目前为止,我有以下方法可以正常工作,但比numpy的quicksort慢约10倍.
测试和基准测试
a = np.random.randint(0, 1e8, 1e6)
assert(np.all(radix_sort(a) == np.sort(a)))
%timeit np.sort(a)
%timeit radix_sort(a)
mask_b
循环可以至少部分地进行矢量化,从&
的各个掩码中广播出去,并在axis
arg中使用cumsum
,但这最终是一种悲观化,大概是由于增加了内存占用.
如果任何人都能看到一种方法来改善我的兴趣,即使它仍然比np.sort
慢...这更多是一种出于智力好奇和对numpy技巧的兴趣. /p>
请注意,您可以很容易地实现快速计数排序,尽管这仅与小整数数据有关.
> 将np.arange(n)
移出循环会有所帮助,但这并不是很有趣.
cumsum
实际上是多余的(糟糕!),但是这个简单的版本仅对性能有所帮助.
def radix_sort(a):
bit_len = np.max(a).bit_length()
n = len(a)
cached_arange = arange(n)
idx = np.empty(n, dtype=int) # fully overwritten each iteration
for mask_b in xrange(bit_len):
is_one = (a & 2**mask_b).astype(bool)
n_ones = np.sum(is_one)
n_zeros = n-n_ones
idx[~is_one] = cached_arange[:n_zeros]
idx[is_one] = cached_arange[:n_ones] + n_zeros
# next three lines just do: a[idx] = a, but correctly
new_a = np.empty(n, dtype=a.dtype)
new_a[idx] = a
a = new_a
return a
如果要分多个步骤构造idx,则可以一次遍历两个或多个,而不是遍历单个位.使用2位会有所帮助,我还没有尝试更多:
idx[is_zero] = np.arange(n_zeros)
idx[is_one] = np.arange(n_ones)
idx[is_two] = np.arange(n_twos)
idx[is_three] = np.arange(n_threes)
编辑4和5:似乎最适合我正在测试的输入.另外,您可以完全摆脱idx
步骤.现在仅比np.sort
慢5倍,而不是10倍(可用的要点):
这是上述内容的整理版本,但更慢. 80%的时间都花在repeat
和extract
上-如果只有一种方法可以广播extract
:( ...
def radix_sort(a, batch_m_bits=3):
bit_len = np.max(a).bit_length()
batch_m = 2**batch_m_bits
mask = 2**batch_m_bits - 1
val_set = np.arange(batch_m, dtype=a.dtype)[:, nax] # nax = np.newaxis
for _ in range((bit_len-1)//batch_m_bits + 1): # ceil-division
a = np.extract((a & mask)[nax, :] == val_set,
np.repeat(a[nax, :], batch_m, axis=0))
val_set <<= batch_m_bits
mask <<= batch_m_bits
return a
编辑7& 8:实际上,您可以使用numpy.lib.stride_tricks
中的as_strided
广播摘录,但是在性能方面似乎没有太大帮助:
最初,这对我来说是有道理的,因为extract
将遍历整个数组batch_m
次,因此CPU请求的缓存行总数将与以前相同(在处理结束时,它已请求每个高速缓存行batch_m
次).但是事实是extract
不能足够聪明地遍历任意步进数组,并且必须在开始之前扩展数组,即无论如何重复都将结束.
实际上,在查看extract
的来源之后,我现在发现我们可以用这种方法做的最好的事情是:
a = a[np.flatnonzero((a & mask)[nax, :] == val_set) % len(a)]
,它比extract
稍慢.但是,如果len(a)
是2的幂,我们可以用& (len(a) - 1)
替换昂贵的mod操作,这实际上比extract
版本快一点(对于a=randint(0, 1e8, 2**20
,现在约为4.9x np.sort
) .我想我们可以通过零填充来解决两个长度的非幂次问题,然后在排序的末尾裁剪多余的零...但是,除非长度已经接近于幂次幂,否则这将是一种悲观化两个.
我和Numba一起去看看基数排序有多快.使用Numba(通常)获得良好性能的关键是写出所有循环,这很有启发性.我结束了以下内容:
from numba import jit
@jit
def radix_loop(nbatches, batch_m_bits, bitsums, a, out):
mask = (1 << batch_m_bits) - 1
for shift in range(0, nbatches*batch_m_bits, batch_m_bits):
# set bit sums to zero
for i in range(bitsums.shape[0]):
bitsums[i] = 0
# determine bit sums
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
bitsums[j] += 1
# take the cumsum of the bit sums
cumsum = 0
for i in range(bitsums.shape[0]):
temp = bitsums[i]
bitsums[i] = cumsum
cumsum += temp
# sorting loop
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
out[bitsums[j]] = a[i]
bitsums[j] += 1
# prepare next iteration
mask <<= batch_m_bits
# cant use `temp` here because of numba internal types
temp2 = a
a = out
out = temp2
return a
从4个内部循环中,很容易看出它是第4个循环,很难用Numpy进行矢量化.
解决该问题的一种方法是从Scipy中引入特定的C ++函数: 对该功能进行了一些优化..请参阅编辑历史记录. 像上面这样的 LSB 基数排序效率低下的原因是该数组在RAM中被完全改组了次数,这意味着CPU缓存使用得不好.为了尝试减轻这种影响,可以选择先对MSB基数排序进行一次传递,然后将项目放入大约正确的RAM块中,然后再对每个结果组进行LSB基数排序.这是一个实现: 时间(在我的系统上具有最佳性能的设置): Numba表现出色,符合预期.而且,通过对现有C扩展名的巧妙应用,也有可能击败 让我印象深刻的另一件事是对基数选择的敏感性.对于我尝试的大多数设置,我的实现仍然比 Numpy doesn't yet have a radix sort, so I wondered whether it was possible to write one using pre-existing numpy functions. So far I have the following, which does work, but is about 10 times slower than numpy's quicksort. Test and benchmark: The If anyone can see a way to improve on what I have I'd be interested to hear, even if it's still slower than Note that you can implement a fast counting sort easily enough, though that's only relevant for small integer data. Edit 1: Taking Edit 2: The Edit 3: rather than loop over single bits, you can loop over two or more at a time, if you construct idx in multiple steps. Using 2 bits helps a little, I've not tried more: Edits 4 and 5: going to 4 bits seems best for the input I'm testing. Also, you can get rid of the Edit 6: This is a tidied up version of the above, but it's also a tiny bit slower. 80% of the time is spent on Edits 7 & 8: Actually, you can broadcast the extract using Initially this made sense to me on the grounds that which is marginally slower than I had a go with Numba to see how fast a radix sort could be. The key to good performance with Numba (often) is to write out all the loops, which is very instructive. I ended up with the following: From the 4 inner loops, it's easy to see it's the 4th one making it hard to vectorize with Numpy. One way to cheat around that problem is to pull in a particular C++ function from Scipy: EDIT: Optimized the function a little bit.. see edit history. One inefficiency of LSB radix sorting like above is that the array is completely shuffled in RAM a number of times, which means the CPU cache isn't used very well. To try to mitigate this effect, one could opt to first do a pass with MSB radix sort, to put items in roughly the right block of RAM, before sorting every resulting group with a LSB radix sort. Here's one implementation: Timings (with best performing settings for each on my system): Numba performs very well, as expected. And also with some clever application of existing C-extensions it's possible to beat One other thing that strikes me is the sensitivity to the choice of radix. For most of the settings I tried my implementations were still slower than 这篇关于用numpy进行向量化的基数排序-它可以击败np.sort吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!from scipy.sparse.coo import coo_tocsr
def radix_step(radix, keys, bitsums, a, w):
coo_tocsr(radix, 1, a.size, keys, a, a, bitsums, w, w)
return w, a
def scipysparse_radix_perbyte(a):
# coo_tocsr internally works with system int and upcasts
# anything else. We need to copy anyway to not mess with
# original array. Also take into account endianness...
a = a.astype('<i', copy=True)
bitlen = int(a.max()).bit_length()
radix = 256
work = np.empty_like(a)
_ = np.empty(radix+1, int)
for i in range((bitlen-1)//8 + 1):
keys = a.view('u1')[i::a.itemsize].astype(int)
a, work = radix_step(radix, keys, _, a, work)
return a
def scipysparse_radix_hybrid(a, bbits=8, gbits=8):
"""
Parameters
----------
a : Array of non-negative integers to be sorted.
bbits : Number of bits in radix for LSB sorting.
gbits : Number of bits in radix for MSB grouping.
"""
a = a.copy()
bitlen = int(a.max()).bit_length()
work = np.empty_like(a)
# Group values by single iteration of MSB radix sort:
# Casting to np.int_ to get rid of python BigInt
ngroups = np.int_(2**gbits)
group_offset = np.empty(ngroups + 1, int)
shift = max(bitlen-gbits, 0)
a, work = radix_step(ngroups, a>>shift, group_offset, a, work)
bitlen = shift
if not bitlen:
return a
# LSB radix sort each group:
agroups = np.split(a, group_offset[1:-1])
# Mask off high bits to not undo the grouping..
gmask = (1 << shift) - 1
nbatch = (bitlen-1) // bbits + 1
radix = np.int_(2**bbits)
_ = np.empty(radix + 1, int)
for agi in agroups:
if not agi.size:
continue
mask = (radix - 1) & gmask
wgi = work[:agi.size]
for shift in range(0, nbatch*bbits, bbits):
keys = (agi & mask) >> shift
agi, wgi = radix_step(radix, keys, _, agi, wgi)
mask = (mask << bbits) & gmask
if nbatch % 2:
# Copy result back in to `a`
wgi[...] = agi
return a
def numba_radix(a, batch_m_bits=8):
a = a.copy()
bit_len = int(a.max()).bit_length()
nbatches = (bit_len-1)//batch_m_bits +1
work = np.zeros_like(a)
bitsums = np.zeros(2**batch_m_bits + 1, int)
srtd = radix_loop(nbatches, batch_m_bits, bitsums, a, work)
return srtd
a = np.random.randint(0, 1e8, 1e6)
%timeit numba_radix(a, 9)
# 10 loops, best of 3: 76.1 ms per loop
%timeit np.sort(a)
#10 loops, best of 3: 115 ms per loop
%timeit scipysparse_radix_perbyte(a)
#10 loops, best of 3: 95.2 ms per loop
%timeit scipysparse_radix_hybrid(a, 11, 6)
#10 loops, best of 3: 75.4 ms per loop
numpy.sort
.在您已经获得优化的IMO级别上,值得考虑一下Numpy的附加组件,但是我不会真正考虑答案向量化"中的实现:大部分工作是在外部完成的专用功能.numpy.sort
慢,因此在实践中,将需要某种启发式方法来全面提供良好的性能.a = np.random.randint(0, 1e8, 1e6)
assert(np.all(radix_sort(a) == np.sort(a)))
%timeit np.sort(a)
%timeit radix_sort(a)
mask_b
loop can be at least partially vectorized, broadcasting out across masks from &
, and using cumsum
with axis
arg, but that ends up being a pessimization, presumably due to the increased memory footprint. np.sort
...this is more a case of intellectual curiosity and interest in numpy tricks.np.arange(n)
out of the loop helps a little, but that's not very exiciting.cumsum
was actually redundant (ooops!) but this simpler version only helps marginally with performance..def radix_sort(a):
bit_len = np.max(a).bit_length()
n = len(a)
cached_arange = arange(n)
idx = np.empty(n, dtype=int) # fully overwritten each iteration
for mask_b in xrange(bit_len):
is_one = (a & 2**mask_b).astype(bool)
n_ones = np.sum(is_one)
n_zeros = n-n_ones
idx[~is_one] = cached_arange[:n_zeros]
idx[is_one] = cached_arange[:n_ones] + n_zeros
# next three lines just do: a[idx] = a, but correctly
new_a = np.empty(n, dtype=a.dtype)
new_a[idx] = a
a = new_a
return a
idx[is_zero] = np.arange(n_zeros)
idx[is_one] = np.arange(n_ones)
idx[is_two] = np.arange(n_twos)
idx[is_three] = np.arange(n_threes)
idx
step entirely. Now only about 5 times, rather than 10 times, slower than np.sort
(source available as gist):repeat
and extract
- if only there was a way to broadcast the extract
:( ...def radix_sort(a, batch_m_bits=3):
bit_len = np.max(a).bit_length()
batch_m = 2**batch_m_bits
mask = 2**batch_m_bits - 1
val_set = np.arange(batch_m, dtype=a.dtype)[:, nax] # nax = np.newaxis
for _ in range((bit_len-1)//batch_m_bits + 1): # ceil-division
a = np.extract((a & mask)[nax, :] == val_set,
np.repeat(a[nax, :], batch_m, axis=0))
val_set <<= batch_m_bits
mask <<= batch_m_bits
return a
as_strided
from numpy.lib.stride_tricks
, but it doesn't seem to help much performance-wise:extract
will be iterating over the whole array batch_m
times, so the total number of cache lines requested by the CPU will be the same as before (it's just that by the end of the process it has request each cache line batch_m
times). However the reality is that extract
is not clever enough to iterate over arbitrary stepped arrays, and has to expand out the array before beginning, i.e. the repeat ends up being done anyway.
In fact, having looked at the source for extract
, I now see that the best we can do with this approach is:a = a[np.flatnonzero((a & mask)[nax, :] == val_set) % len(a)]
extract
. However, if len(a)
is a power of two we can replace the expensive mod operation with & (len(a) - 1)
, which does end up being a bit faster than the extract
version (now about 4.9x np.sort
for a=randint(0, 1e8, 2**20
). I suppose we could make this work for non-power of two lengths by zero-padding, and then cropping the extra zeros at the end of the sort...however this would be a pessimisation unless the length was already close to being power of two.from numba import jit
@jit
def radix_loop(nbatches, batch_m_bits, bitsums, a, out):
mask = (1 << batch_m_bits) - 1
for shift in range(0, nbatches*batch_m_bits, batch_m_bits):
# set bit sums to zero
for i in range(bitsums.shape[0]):
bitsums[i] = 0
# determine bit sums
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
bitsums[j] += 1
# take the cumsum of the bit sums
cumsum = 0
for i in range(bitsums.shape[0]):
temp = bitsums[i]
bitsums[i] = cumsum
cumsum += temp
# sorting loop
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
out[bitsums[j]] = a[i]
bitsums[j] += 1
# prepare next iteration
mask <<= batch_m_bits
# cant use `temp` here because of numba internal types
temp2 = a
a = out
out = temp2
return a
scipy.sparse.coo.coo_tocsr
. It does pretty much the same inner loops as the Python function above, so it can be abused to write a faster "vectorized" radix sort in Python. Maybe something like:from scipy.sparse.coo import coo_tocsr
def radix_step(radix, keys, bitsums, a, w):
coo_tocsr(radix, 1, a.size, keys, a, a, bitsums, w, w)
return w, a
def scipysparse_radix_perbyte(a):
# coo_tocsr internally works with system int and upcasts
# anything else. We need to copy anyway to not mess with
# original array. Also take into account endianness...
a = a.astype('<i', copy=True)
bitlen = int(a.max()).bit_length()
radix = 256
work = np.empty_like(a)
_ = np.empty(radix+1, int)
for i in range((bitlen-1)//8 + 1):
keys = a.view('u1')[i::a.itemsize].astype(int)
a, work = radix_step(radix, keys, _, a, work)
return a
def scipysparse_radix_hybrid(a, bbits=8, gbits=8):
"""
Parameters
----------
a : Array of non-negative integers to be sorted.
bbits : Number of bits in radix for LSB sorting.
gbits : Number of bits in radix for MSB grouping.
"""
a = a.copy()
bitlen = int(a.max()).bit_length()
work = np.empty_like(a)
# Group values by single iteration of MSB radix sort:
# Casting to np.int_ to get rid of python BigInt
ngroups = np.int_(2**gbits)
group_offset = np.empty(ngroups + 1, int)
shift = max(bitlen-gbits, 0)
a, work = radix_step(ngroups, a>>shift, group_offset, a, work)
bitlen = shift
if not bitlen:
return a
# LSB radix sort each group:
agroups = np.split(a, group_offset[1:-1])
# Mask off high bits to not undo the grouping..
gmask = (1 << shift) - 1
nbatch = (bitlen-1) // bbits + 1
radix = np.int_(2**bbits)
_ = np.empty(radix + 1, int)
for agi in agroups:
if not agi.size:
continue
mask = (radix - 1) & gmask
wgi = work[:agi.size]
for shift in range(0, nbatch*bbits, bbits):
keys = (agi & mask) >> shift
agi, wgi = radix_step(radix, keys, _, agi, wgi)
mask = (mask << bbits) & gmask
if nbatch % 2:
# Copy result back in to `a`
wgi[...] = agi
return a
def numba_radix(a, batch_m_bits=8):
a = a.copy()
bit_len = int(a.max()).bit_length()
nbatches = (bit_len-1)//batch_m_bits +1
work = np.zeros_like(a)
bitsums = np.zeros(2**batch_m_bits + 1, int)
srtd = radix_loop(nbatches, batch_m_bits, bitsums, a, work)
return srtd
a = np.random.randint(0, 1e8, 1e6)
%timeit numba_radix(a, 9)
# 10 loops, best of 3: 76.1 ms per loop
%timeit np.sort(a)
#10 loops, best of 3: 115 ms per loop
%timeit scipysparse_radix_perbyte(a)
#10 loops, best of 3: 95.2 ms per loop
%timeit scipysparse_radix_hybrid(a, 11, 6)
#10 loops, best of 3: 75.4 ms per loop
numpy.sort
. IMO at the level of optimization you've already gotten it's worth-it to also consider add-ons to Numpy, but I wouldn't really consider the implementations in my answer "vectorized": The bulk of the work is done in a external dedicated function.numpy.sort
, so in practice some sort of heuristic would be required to offer good performance across the board.