快速对数计算 [英] Fast logarithm calculation

查看:406
本文介绍了快速对数计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所有代码都在linux上的同一台计算机上运行。



在python中:

  import numpy as np 
drr = abs(np.random.randn(100000,50))
%timeit np.log2(drr)



10个循环,最好的3:每个循环77.9 ms



在C ++使用g ++ -o log ./log.cpp -std = c ++ 11 -O3编译):

  #include< iostream> 
#include< iomanip>
#include< string>
#include< map>
#include< random>
#include< ctime>
int main()
{
std :: mt19937 e2(0);
std :: normal_distribution<> dist(0,1);
const int n_seq = 100000;
const int l_seq = 50;
static double x [n_seq] [l_seq];
for(int n = 0; n for(int k = 0; k x [n] k] = abs(dist(e2));
if(x [n] [k] <= 0)
x [n] [k] = 0.1;
}
}
clock_t begin = clock();

for(int n = 0; n< n_seq; ++ n){
for(int k = 0; k x [n] [k] = std :: log2(x [n] [k]);
}
}
clock_t end = clock();

以60 ms运行



  abr = abs(randn(100000,50)); 
tic; abr = log2(abr); toc

已用时间为7.8 ms。 / p>

我可以理解C ++和numpy之间的速度差异,但MATLAB打败了一切。
我遇到了
http://fastapprox.googlecode.com /svn/trunk/fastapprox/src/fastonebigheader.h
但是这只是float,不是double,我不知道如何将它转换为double。



我也尝试过:
http: //hackage.haskell.org/package/approximate-0.2.2.1/src/cbits/fast.c
它具有快速日志功能,编译为numpy ufunc时运行时间为20 ms,这是伟大的,但精度的损失是显着的。



有关如何实现MATLAB神奇的log2速度的任何想法?



UPDATE



感谢大家的意见,这非常快速,非常有帮助!实际上,答案是并行化,即在几个线程上分散负载。根据@morningsun建议,



%timeit numexpr.evaluate('log(drr)')



ms,这是与MATLAB一样,谢谢! numexpr is MKL enabled

解决方案

请注意,以下所有都是float32,而不是双精度。



UPDATE
我完全弃用了gcc以支持英特尔的icc。当性能至关重要,当你没有时间微调你的编译器提示来强制执行gcc向量化时,这一切都会有所不同(参见例如 here



log_omp.c



GCC:gcc -o log_omp.so -fopenmp log_omp.c -lm -O3 -fPIC -shared -std = c99



ICC:icc -o log_omp.so -openmp loge_omp.c -lm -O3 -fPIC -shared -std = c99 -vec-report1 -xAVX -I / opt / intel / composer / mkl / include

  #include< math.h> 
#includeomp.h
#includemkl_vml.h

#define restrict __restrict

inline void log_omp * restrict a,float * restrict c);

void log_omp(int m,float * restrict a,float * restrict c)
{
int i;
#pragma omp parallel for default(none)shared(m,a,c)private(i)
for(i = 0; i a [i] log(c [i]);
}
}

// VML / icc only:
void log_VML(int m,float * restrict a,float * restrict c)
{
int i;
int split_to = 14;
int iter = m / split_to;
int additional = m%split_to;

// vsLn(m,c,a);
#pragma omp parallel for default(none)shared(m,a,c,additional,iter)private(i)num_threads(split_to)
for(i = 0; i < ); i + = iter)
vsLog10(iter,c + i,a + i)
// vmsLn(iter,c + i,a + i,VML_HA);

if(additional> 0)
vsLog10(附加,c + m附加,a + m附加);
// vmsLn(additional,c + m-additional,a + m-additional,VML_HA);
}

in python:
$ b

 来自ctypes import CDLL,c_int,c_void_p 
def log_omp(xs,out):
lib = CDLL('./ log_omp .so')
lib.log_omp.argtypes = [c_int,np.ctypeslib.ndpointer(dtype = np.float32),np.ctypeslib.ndpointer(dtype = np.float32)]
lib.log_omp .restype = c_void_p
n = xs.shape [0]
out = np.empty(n,np.float32)
lib.log_omp(n,out,xs)
return out

Cython代码(在ipython notebook中,因此是%% magic):

  %% cython --compile-args = -fopenmp --link-args = -fopenmp 
import numpy as np
cimport numpy as np
从libc.math cimport log

从cython.parallel cimport prange
import cython

@cython .boundscheck(False)
def cylog(np.ndarray [np.float32_t,ndim = 1] a not None,
np.ndarray [np.float32_t,ndim = 1] out = None):
如果out为None:
out = np.empty((a.shape [0]),dtype = a.dtype)
cdef Py_ssize_t i
with nogil:
for i in prange(a.shape [0]):
out [i] = log(a [i])
return out

计时:

  numexpr。 detect_number_of_cores()// 2 
28

%env OMP_NUM_THREADS = 28
x = np.abs(np.random.randn(50000000).astype('float32'))
y = x.copy()

#GCC
%timeit log_omp(x,y)
10循环,最好的3:每循环21.6 ms
#ICC
%timeit log_omp(x,y)
100循环,最好是3:每循环9.6 ms
%timeit log_VML(x,y)
100循环3:每个循环10 ms

%timeit cylog(x,out = y)
10个循环,最好的3:每个循环21.7 ms

numexpr.set_num_threads (28)
%timeit out = numexpr.evaluate('log(x)')
100个循环,最好的3:每个循环13 ms
pre>

所以numexpr似乎比gcc编译好,但是icc胜过。



一些资源我发现有用和可耻使用的代码:



http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html



< a href =https://gist.github.com/zed/2051661 =nofollow> https://gist.github.com/zed/2051661


All the code was run on the same machine on linux.

In python:

import numpy as np
drr = abs(np.random.randn(100000,50))
%timeit np.log2(drr)

10 loops, best of 3: 77.9 ms per loop

In C++ (compiled with g++ -o log ./log.cpp -std=c++11 -O3):

#include <iostream>
#include <iomanip>
#include <string>
#include <map>
#include <random>
#include <ctime>
int main()
{
std::mt19937 e2(0);
std::normal_distribution<> dist(0, 1);
const int n_seq = 100000;
const int l_seq = 50;
static double x[n_seq][l_seq];
for (int n = 0;n < n_seq; ++n) {
  for (int k = 0; k < l_seq; ++k) {
    x[n][k] = abs(dist(e2));
    if(x[n][k] <= 0)
      x[n][k] = 0.1;
    }
  }
 clock_t begin = clock();

 for (int n = 0; n < n_seq; ++n) {
   for (int k = 0; k < l_seq; ++k) {
     x[n][k] = std::log2(x[n][k]);
       }
  }
  clock_t end = clock();

Runs in 60 ms

In MATLAB:

abr = abs(randn(100000,50));
tic;abr=log2(abr);toc

Elapsed time is 7.8 ms.

I can understand the speed difference between C++ and numpy, but MATLAB beats everything. I've come across http://fastapprox.googlecode.com/svn/trunk/fastapprox/src/fastonebigheader.h but this does only float, not double, and I'm not sure how to convert it to double.

I also tried this: http://hackage.haskell.org/package/approximate-0.2.2.1/src/cbits/fast.c which has fast log functions, and when compiled as a numpy ufunc, runs in 20 ms, which is great, but the loss in accuracy is significant.

Any ideas on how to achieve the magical log2 speed that MATLAB gets?

UPDATE

Thank you all for comments, that was very fast and very helpful! Indeed, the answer is parallelisation, i.e. spreading the load on several threads. Following @morningsun suggestion,

%timeit numexpr.evaluate('log(drr)')

gives 5.6 ms, which is on par with MATLAB, thank you! numexpr is MKL enabled

解决方案

Note that ALL below are float32, not double precision.

UPDATE: I've ditched gcc completely in favour of Intel's icc. It makes ALL the difference when performance is critical and when you don't have time to fine-tune your "compiler hints" to enforce gcc vectorization (see, e.g. here)

log_omp.c,

GCC: gcc -o log_omp.so -fopenmp log_omp.c -lm -O3 -fPIC -shared -std=c99

ICC: icc -o log_omp.so -openmp loge_omp.c -lm -O3 -fPIC -shared -std=c99 -vec-report1 -xAVX -I/opt/intel/composer/mkl/include

#include <math.h>
#include "omp.h"
#include "mkl_vml.h"

#define restrict __restrict

inline void log_omp(int m, float * restrict a, float * restrict c);

void log_omp(int m, float * restrict a, float * restrict c)
{
   int i;
#pragma omp parallel for default(none) shared(m,a,c) private(i)
   for (i=0; i<m; i++) {
      a[i] = log(c[i]);
   }
}

// VML / icc only:
void log_VML(int m, float * restrict a, float * restrict c)
{
   int i;
   int split_to = 14;
   int iter = m / split_to;
   int additional = m % split_to;

//   vsLn(m, c, a);
#pragma omp parallel for default(none) shared(m,a,c, additional, iter) private(i) num_threads(split_to)
   for (i=0;i < (m-additional); i+=iter)
     vsLog10(iter,c+i,a+i);
     //vmsLn(iter,c+i,a+i, VML_HA);

   if (additional > 0)
     vsLog10(additional, c+m-additional, a+m-additional);
     //vmsLn(additional, c+m-additional, a+m-additional, VML_HA);
}

in python:

from ctypes import CDLL, c_int, c_void_p
def log_omp(xs, out):
    lib = CDLL('./log_omp.so')
    lib.log_omp.argtypes = [c_int, np.ctypeslib.ndpointer(dtype=np.float32), np.ctypeslib.ndpointer(dtype=np.float32)]
    lib.log_omp.restype  = c_void_p
    n = xs.shape[0]
    out = np.empty(n, np.float32)
    lib.log_omp(n, out, xs)
    return out

Cython code (in ipython notebook, hence the %% magic):

%%cython --compile-args=-fopenmp --link-args=-fopenmp
import  numpy as np
cimport numpy as np
from libc.math cimport log

from cython.parallel cimport prange
import cython

@cython.boundscheck(False)
def cylog(np.ndarray[np.float32_t, ndim=1] a not None,
        np.ndarray[np.float32_t, ndim=1] out=None):
    if out is None:
        out = np.empty((a.shape[0]), dtype=a.dtype)
    cdef Py_ssize_t i
    with nogil:
        for i in prange(a.shape[0]):
            out[i] = log(a[i])
    return out

Timings:

numexpr.detect_number_of_cores() // 2
28

%env OMP_NUM_THREADS=28
x = np.abs(np.random.randn(50000000).astype('float32'))
y = x.copy()

# GCC
%timeit log_omp(x, y)
10 loops, best of 3: 21.6 ms per loop
# ICC
%timeit log_omp(x, y)
100 loops, best of 3: 9.6 ms per loop
%timeit log_VML(x, y)
100 loops, best of 3: 10 ms per loop

%timeit cylog(x, out=y)
10 loops, best of 3: 21.7 ms per loop

numexpr.set_num_threads(28)
%timeit out = numexpr.evaluate('log(x)')
100 loops, best of 3: 13 ms per loop

So numexpr seems to be doing a better job than (poorly) compiled gcc code, but icc wins.

Some resources I found useful and shamefully used code from:

http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html

https://gist.github.com/zed/2051661

这篇关于快速对数计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆