如何使numba @jit使用所有cpu核心(并行化numba @jit) [英] How to make numba @jit use all cpu cores (parallelize numba @jit)

查看:804
本文介绍了如何使numba @jit使用所有cpu核心(并行化numba @jit)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用numbas @jit装饰器在python中添加两个numpy数组.如果将@jitpython相比,则性能是如此之高.

I am using numbas @jit decorator for adding two numpy arrays in python. The performance is so high if I use @jit compared with python.

但是,即使我通过了@numba.jit(nopython = True, parallel = True, nogil = True),也仍然没有不使用所有CPU内核.

However it is not utilizing all CPU cores even if I pass in @numba.jit(nopython = True, parallel = True, nogil = True).

是否可以使用numba @jit来利用所有CPU内核.

Is there any way to to make use of all CPU cores with numba @jit.

这是我的代码:

import time                                                
import numpy as np                                         
import numba                                               

SIZE = 2147483648 * 6                                      

a = np.full(SIZE, 1, dtype = np.int32)                     

b = np.full(SIZE, 1, dtype = np.int32)                     

c = np.ndarray(SIZE, dtype = np.int32)                     

@numba.jit(nopython = True, parallel = True, nogil = True) 
def add(a, b, c):                                          
    for i in range(SIZE):                                  
        c[i] = a[i] + b[i]                                 

start = time.time()                                        
add(a, b, c)                                               
end = time.time()                                          

print(end - start)                                        

推荐答案

您可以将parallel=True传递给任何numba jitted函数,但这并不意味着它一直在使用所有内核.您必须了解numba使用一些启发式方法使代码并行执行,有时这些启发式方法根本找不到在代码中要并行化的内容.当前有一个拉动请求,以便在无法发出警告时发出警告它平行".因此,它更像是请使其尽可能并行执行"参数,而不是强制并行执行".

You can pass parallel=True to any numba jitted function but that doesn't mean it's always utilizing all cores. You have to understand that numba uses some heuristics to make the code execute in parallel, sometimes these heuristics simply don't find anything to parallelize in the code. There's currently a pull request so that it issues a Warning if it wasn't possible to make it "parallel". So it's more like an "please make it execute in parallel if possible" parameter not "enforce parallel execution".

但是,如果您确实知道可以并行化代码,则始终可以手动使用线程或进程.只需改编使用numba文档中的多线程示例:

However you can always use threads or processes manually if you really know you can parallelize your code. Just adapting the example of using multi-threading from the numba docs:

#!/usr/bin/env python
from __future__ import print_function, division, absolute_import

import math
import threading
from timeit import repeat

import numpy as np
from numba import jit

nthreads = 4
size = 10**7  # CHANGED

# CHANGED
def func_np(a, b):
    """
    Control function using Numpy.
    """
    return a + b

# CHANGED
@jit('void(double[:], double[:], double[:])', nopython=True, nogil=True)
def inner_func_nb(result, a, b):
    """
    Function under test.
    """
    for i in range(len(result)):
        result[i] = a[i] + b[i]

def timefunc(correct, s, func, *args, **kwargs):
    """
    Benchmark *func* and print out its runtime.
    """
    print(s.ljust(20), end=" ")
    # Make sure the function is compiled before we start the benchmark
    res = func(*args, **kwargs)
    if correct is not None:
        assert np.allclose(res, correct), (res, correct)
    # time it
    print('{:>5.0f} ms'.format(min(repeat(lambda: func(*args, **kwargs),
                                          number=5, repeat=2)) * 1000))
    return res

def make_singlethread(inner_func):
    """
    Run the given function inside a single thread.
    """
    def func(*args):
        length = len(args[0])
        result = np.empty(length, dtype=np.float64)
        inner_func(result, *args)
        return result
    return func

def make_multithread(inner_func, numthreads):
    """
    Run the given function inside *numthreads* threads, splitting its
    arguments into equal-sized chunks.
    """
    def func_mt(*args):
        length = len(args[0])
        result = np.empty(length, dtype=np.float64)
        args = (result,) + args
        chunklen = (length + numthreads - 1) // numthreads
        # Create argument tuples for each input chunk
        chunks = [[arg[i * chunklen:(i + 1) * chunklen] for arg in args]
                  for i in range(numthreads)]
        # Spawn one thread per chunk
        threads = [threading.Thread(target=inner_func, args=chunk)
                   for chunk in chunks]
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
        return result
    return func_mt


func_nb = make_singlethread(inner_func_nb)
func_nb_mt = make_multithread(inner_func_nb, nthreads)

a = np.random.rand(size)
b = np.random.rand(size)

correct = timefunc(None, "numpy (1 thread)", func_np, a, b)
timefunc(correct, "numba (1 thread)", func_nb, a, b)
timefunc(correct, "numba (%d threads)" % nthreads, func_nb_mt, a, b)

我突出显示了我更改的部分,所有其他内容均从示例中逐字复制.这利用了我计算机上的所有内核(4内核计算机,因此有4个线程),但是并没有显示出明显的加速:

I highlighted the parts which I changed, everything else was copied verbatim from the example. This utilizes all cores on my machine (4 core machine therefore 4 threads) but doesn't show a significant speedup:

numpy (1 thread)       539 ms
numba (1 thread)       536 ms
numba (4 threads)      442 ms

在这种情况下,多线程缺乏(很多)加速是因为加法是带宽受限的操作.这意味着从数组中加载元素并将结果放置在结果数组中要比实际加法花费更多的时间.

The lack of (much) speedup with multithreading in this case is that addition is a bandwidth-limited operation. That means it takes much more time to load the elements from the array and place the result in the result array than to do the actual addition.

在这些情况下,由于并行执行,您甚至可能会看到速度变慢!

只有与加载和存储数组元素相比,如果函数更复杂并且实际操作要花费大量时间,并行执行才会有很大的改善. numba文档中的示例就是这样的:

Only if the functions are more complex and the actual operation takes significant time compared to loading and storing of array elements you'll see a big improvement with parallel execution. The example in the numba documentation is one like that:

def func_np(a, b):
    """
    Control function using Numpy.
    """
    return np.exp(2.1 * a + 3.2 * b)

@jit('void(double[:], double[:], double[:])', nopython=True, nogil=True)
def inner_func_nb(result, a, b):
    """
    Function under test.
    """
    for i in range(len(result)):
        result[i] = math.exp(2.1 * a[i] + 3.2 * b[i])

这实际上(几乎)随线程数扩展,因为两次乘法,一次加法和一次调用math.exp比加载和存储结果要慢得多:

This actually scales (almost) with the number of threads because two multiplications, one addition and one call to math.exp is much slower than loading and storing results:

func_nb = make_singlethread(inner_func_nb)
func_nb_mt2 = make_multithread(inner_func_nb, 2)
func_nb_mt3 = make_multithread(inner_func_nb, 3)
func_nb_mt4 = make_multithread(inner_func_nb, 4)

a = np.random.rand(size)
b = np.random.rand(size)

correct = timefunc(None, "numpy (1 thread)", func_np, a, b)
timefunc(correct, "numba (1 thread)", func_nb, a, b)
timefunc(correct, "numba (2 threads)", func_nb_mt2, a, b)
timefunc(correct, "numba (3 threads)", func_nb_mt3, a, b)
timefunc(correct, "numba (4 threads)", func_nb_mt4, a, b)

结果:

numpy (1 thread)      3422 ms
numba (1 thread)      2959 ms
numba (2 threads)     1555 ms
numba (3 threads)     1080 ms
numba (4 threads)      797 ms

这篇关于如何使numba @jit使用所有cpu核心(并行化numba @jit)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆