并行执行位运算的代码 [英] Parallelize code which is doing bit wise operation

查看:82
本文介绍了并行执行位运算的代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这段代码通过将每个AU矩阵的每个字节8个元素打包到A中以减少内存消耗,从而使100k * 200k矩阵占用更少的空间.这段代码将永远运行,正如您所期望的一样,我也计划将行数也增加到200k.我正在非常强大的实例(CPU和GPU)上运行代码,并且可以对其进行扩展,因此任何人都可以帮助并行化此代码,以便更快地实现.

I have this code to make a 100k*200k matrix occupy less space by packing 8 elements per byte of this AU matrix into A to reduce the memory consumption. This code takes forever to run as you can expect and i am planning on increasing the number of rows to 200k as well. I am running the code on a pretty powerful instance (CPU and GPU)and can scale it so can anyone help parallelize this code so that it is quicker.

import numpy as np
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()

A = np.empty((rows,colm), dtype=np.uint8)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        A[i,j] = 0
        for k in range(8):
            if AU[i,(j*8)+k] == 1:
                A[i,j] = A[i,j] | (1<<(7-k))

推荐答案

警告:您尝试分配巨大的内存量:大约 2 TB记忆力,而您可能没有.

Warning: You try to allocate a huge amount of memory: about 2 TB of memory that you probably do not have.

假设您有足够的内存或可以减小数据集的大小,则可以使用 Numba JIT 编写快得多的实现.此外,您可以并行化代码,并用 lessbranch 实现替换慢速条件,以显着加快计算速度,因为 AU 被二进制值填充.最后,您可以展开 k 上运行的内部循环,以使代码更快.这是结果实现:

Assuming you have enough memory or you can reduce the size of the dataset, you can write a much much faster implementation using the Numba JIT. Moreover, you can parallelize the code and replace the slow conditional with a branchless implementation to significantly speed up the computation since AU is filled with binary values. Finally, you can unroll the inner loop working on k to make the code even faster. Here is the resulting implementation:

import numpy as np
import numba as nb
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

compute(A, AU)

在我的计算机上,此代码比在较小数据集(具有 colm = int(20000/8) rows的行)上的原始实现快 37851 倍.= 10000 ).原始实现耗时6min3s,而优化实现耗时9.6ms.

On my machine, this code is 37851 times faster than the original implementation on a smaller dataset (with colm=int(20000/8) and rows=10000). The original implementation took 6min3s while the optimized one took 9.6ms.

此代码是我机器上的内存绑定.在当前输入的情况下,此代码接近最佳状态,因为它花费了大部分时间来读取 AU 输入矩阵.良好的附加优化将是压缩"压缩包.将 AU 矩阵缩小为一个较小的矩阵(如果可能).

This code is memory bound on my machine. With the current inputs, this code is close to be optimal as it spends most of its time reading the AU input matrix. A good additional optimization would be to "compress" the AU matrix to a smaller one (if possible).

这篇关于并行执行位运算的代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆