快速对数计算 [英] Fast logarithm calculation
问题描述
所有代码都在linux上的同一台计算机上运行。
在python中:
import numpy as np
drr = abs(np.random.randn(100000,50))
%timeit np.log2(drr)
10个循环,最好的3:每个循环77.9 ms
在C ++使用g ++ -o log ./log.cpp -std = c ++ 11 -O3编译):
#include< iostream>
#include< iomanip>
#include< string>
#include< map>
#include< random>
#include< ctime>
int main()
{
std :: mt19937 e2(0);
std :: normal_distribution<> dist(0,1);
const int n_seq = 100000;
const int l_seq = 50;
static double x [n_seq] [l_seq];
for(int n = 0; nfor(int k = 0; k x [n] k] = abs(dist(e2));
if(x [n] [k] <= 0)
x [n] [k] = 0.1;
}
}
clock_t begin = clock();
for(int n = 0; n< n_seq; ++ n){
for(int k = 0; kx [n] [k] = std :: log2(x [n] [k]);
}
}
clock_t end = clock();
以60 ms运行
:
abr = abs(randn(100000,50));
tic; abr = log2(abr); toc
已用时间为7.8 ms。 / p>
我可以理解C ++和numpy之间的速度差异,但MATLAB打败了一切。
我遇到了
http://fastapprox.googlecode.com /svn/trunk/fastapprox/src/fastonebigheader.h
但是这只是float,不是double,我不知道如何将它转换为double。
我也尝试过:
http: //hackage.haskell.org/package/approximate-0.2.2.1/src/cbits/fast.c
它具有快速日志功能,编译为numpy ufunc时运行时间为20 ms,这是伟大的,但精度的损失是显着的。
有关如何实现MATLAB神奇的log2速度的任何想法?
UPDATE
感谢大家的意见,这非常快速,非常有帮助!实际上,答案是并行化,即在几个线程上分散负载。根据@morningsun建议,
%timeit numexpr.evaluate('log(drr)')
ms,这是与MATLAB一样,谢谢! numexpr is MKL enabled
解决方案请注意,以下所有都是float32,而不是双精度。
UPDATE :
我完全弃用了gcc以支持英特尔的icc。当性能至关重要,当你没有时间微调你的编译器提示来强制执行gcc向量化时,这一切都会有所不同(参见例如 here )
log_omp.c ,
GCC:gcc -o log_omp.so -fopenmp log_omp.c -lm -O3 -fPIC -shared -std = c99
ICC:icc -o log_omp.so -openmp loge_omp.c -lm -O3 -fPIC -shared -std = c99 -vec-report1 -xAVX -I / opt / intel / composer / mkl / include
#include< math.h>
#includeomp.h
#includemkl_vml.h
#define restrict __restrict
inline void log_omp * restrict a,float * restrict c);
void log_omp(int m,float * restrict a,float * restrict c)
{
int i;
#pragma omp parallel for default(none)shared(m,a,c)private(i)
for(i = 0; ia [i] log(c [i]);
}
}
// VML / icc only:
void log_VML(int m,float * restrict a,float * restrict c)
{
int i;
int split_to = 14;
int iter = m / split_to;
int additional = m%split_to;
// vsLn(m,c,a);
#pragma omp parallel for default(none)shared(m,a,c,additional,iter)private(i)num_threads(split_to)
for(i = 0; i < ); i + = iter)
vsLog10(iter,c + i,a + i)
// vmsLn(iter,c + i,a + i,VML_HA);
if(additional> 0)
vsLog10(附加,c + m附加,a + m附加);
// vmsLn(additional,c + m-additional,a + m-additional,VML_HA);
}
in python:
$ b来自ctypes import CDLL,c_int,c_void_p
def log_omp(xs,out):
lib = CDLL('./ log_omp .so')
lib.log_omp.argtypes = [c_int,np.ctypeslib.ndpointer(dtype = np.float32),np.ctypeslib.ndpointer(dtype = np.float32)]
lib.log_omp .restype = c_void_p
n = xs.shape [0]
out = np.empty(n,np.float32)
lib.log_omp(n,out,xs)
return out
Cython代码(在ipython notebook中,因此是%% magic):
%% cython --compile-args = -fopenmp --link-args = -fopenmp
import numpy as np
cimport numpy as np
从libc.math cimport log
从cython.parallel cimport prange
import cython
@cython .boundscheck(False)
def cylog(np.ndarray [np.float32_t,ndim = 1] a not None,
np.ndarray [np.float32_t,ndim = 1] out = None):
如果out为None:
out = np.empty((a.shape [0]),dtype = a.dtype)
cdef Py_ssize_t i
with nogil:
for i in prange(a.shape [0]):
out [i] = log(a [i])
return out
计时:
numexpr。 detect_number_of_cores()// 2
pre>
28
%env OMP_NUM_THREADS = 28
x = np.abs(np.random.randn(50000000).astype('float32'))
y = x.copy()
#GCC
%timeit log_omp(x,y)
10循环,最好的3:每循环21.6 ms
#ICC
%timeit log_omp(x,y)
100循环,最好是3:每循环9.6 ms
%timeit log_VML(x,y)
100循环3:每个循环10 ms
%timeit cylog(x,out = y)
10个循环,最好的3:每个循环21.7 ms
numexpr.set_num_threads (28)
%timeit out = numexpr.evaluate('log(x)')
100个循环,最好的3:每个循环13 ms
所以numexpr似乎比gcc编译好,但是icc胜过。
一些资源我发现有用和可耻使用的代码:
http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html
< a href =https://gist.github.com/zed/2051661 =nofollow> https://gist.github.com/zed/2051661
All the code was run on the same machine on linux.
In python:
import numpy as np drr = abs(np.random.randn(100000,50)) %timeit np.log2(drr)
10 loops, best of 3: 77.9 ms per loop
In C++ (compiled with g++ -o log ./log.cpp -std=c++11 -O3):
#include <iostream> #include <iomanip> #include <string> #include <map> #include <random> #include <ctime> int main() { std::mt19937 e2(0); std::normal_distribution<> dist(0, 1); const int n_seq = 100000; const int l_seq = 50; static double x[n_seq][l_seq]; for (int n = 0;n < n_seq; ++n) { for (int k = 0; k < l_seq; ++k) { x[n][k] = abs(dist(e2)); if(x[n][k] <= 0) x[n][k] = 0.1; } } clock_t begin = clock(); for (int n = 0; n < n_seq; ++n) { for (int k = 0; k < l_seq; ++k) { x[n][k] = std::log2(x[n][k]); } } clock_t end = clock();
Runs in 60 ms
In MATLAB:
abr = abs(randn(100000,50)); tic;abr=log2(abr);toc
Elapsed time is 7.8 ms.
I can understand the speed difference between C++ and numpy, but MATLAB beats everything. I've come across http://fastapprox.googlecode.com/svn/trunk/fastapprox/src/fastonebigheader.h but this does only float, not double, and I'm not sure how to convert it to double.
I also tried this: http://hackage.haskell.org/package/approximate-0.2.2.1/src/cbits/fast.c which has fast log functions, and when compiled as a numpy ufunc, runs in 20 ms, which is great, but the loss in accuracy is significant.
Any ideas on how to achieve the magical log2 speed that MATLAB gets?
UPDATE
Thank you all for comments, that was very fast and very helpful! Indeed, the answer is parallelisation, i.e. spreading the load on several threads. Following @morningsun suggestion,
%timeit numexpr.evaluate('log(drr)')
gives 5.6 ms, which is on par with MATLAB, thank you! numexpr is MKL enabled
解决方案Note that ALL below are float32, not double precision.
UPDATE: I've ditched gcc completely in favour of Intel's icc. It makes ALL the difference when performance is critical and when you don't have time to fine-tune your "compiler hints" to enforce gcc vectorization (see, e.g. here)
log_omp.c,
GCC: gcc -o log_omp.so -fopenmp log_omp.c -lm -O3 -fPIC -shared -std=c99
ICC: icc -o log_omp.so -openmp loge_omp.c -lm -O3 -fPIC -shared -std=c99 -vec-report1 -xAVX -I/opt/intel/composer/mkl/include
#include <math.h> #include "omp.h" #include "mkl_vml.h" #define restrict __restrict inline void log_omp(int m, float * restrict a, float * restrict c); void log_omp(int m, float * restrict a, float * restrict c) { int i; #pragma omp parallel for default(none) shared(m,a,c) private(i) for (i=0; i<m; i++) { a[i] = log(c[i]); } } // VML / icc only: void log_VML(int m, float * restrict a, float * restrict c) { int i; int split_to = 14; int iter = m / split_to; int additional = m % split_to; // vsLn(m, c, a); #pragma omp parallel for default(none) shared(m,a,c, additional, iter) private(i) num_threads(split_to) for (i=0;i < (m-additional); i+=iter) vsLog10(iter,c+i,a+i); //vmsLn(iter,c+i,a+i, VML_HA); if (additional > 0) vsLog10(additional, c+m-additional, a+m-additional); //vmsLn(additional, c+m-additional, a+m-additional, VML_HA); }
in python:
from ctypes import CDLL, c_int, c_void_p def log_omp(xs, out): lib = CDLL('./log_omp.so') lib.log_omp.argtypes = [c_int, np.ctypeslib.ndpointer(dtype=np.float32), np.ctypeslib.ndpointer(dtype=np.float32)] lib.log_omp.restype = c_void_p n = xs.shape[0] out = np.empty(n, np.float32) lib.log_omp(n, out, xs) return out
Cython code (in ipython notebook, hence the %% magic):
%%cython --compile-args=-fopenmp --link-args=-fopenmp import numpy as np cimport numpy as np from libc.math cimport log from cython.parallel cimport prange import cython @cython.boundscheck(False) def cylog(np.ndarray[np.float32_t, ndim=1] a not None, np.ndarray[np.float32_t, ndim=1] out=None): if out is None: out = np.empty((a.shape[0]), dtype=a.dtype) cdef Py_ssize_t i with nogil: for i in prange(a.shape[0]): out[i] = log(a[i]) return out
Timings:
numexpr.detect_number_of_cores() // 2 28 %env OMP_NUM_THREADS=28 x = np.abs(np.random.randn(50000000).astype('float32')) y = x.copy() # GCC %timeit log_omp(x, y) 10 loops, best of 3: 21.6 ms per loop # ICC %timeit log_omp(x, y) 100 loops, best of 3: 9.6 ms per loop %timeit log_VML(x, y) 100 loops, best of 3: 10 ms per loop %timeit cylog(x, out=y) 10 loops, best of 3: 21.7 ms per loop numexpr.set_num_threads(28) %timeit out = numexpr.evaluate('log(x)') 100 loops, best of 3: 13 ms per loop
So numexpr seems to be doing a better job than (poorly) compiled gcc code, but icc wins.
Some resources I found useful and shamefully used code from:
http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html
https://gist.github.com/zed/2051661
这篇关于快速对数计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!