加快在Python中，矩阵向量乘法和乘方可能通过调用C / C ++ [英] Speeding up matrix-vector multiplication and exponentiation in Python, possibly by calling C/C++

查看：526 发布时间：2016/8/19 0:02:18 python c numpy machine-learning logistic-regression

本文介绍了加快在Python中，矩阵向量乘法和乘方可能通过调用C / C ++的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前工作的一个机器学习项目中 - 赋予了数据矩阵以Z 和一个vector RHO - 我计算物流损失函数的值和斜率在 RHO 。计算涉及到基本的矩阵向量乘法和日志/ EXP操作，用一招避免数值溢出（本<一个描述href=\"http://stackoverflow.com/questions/20085768/avoiding-numerical-overflow-when-calculating-the-value-and-gradient-of-the-logis\">$p$pvious帖子）。

我使用numpy的目前做这在Python，如下图所示（作为参考，这code在0.2秒运行）。虽然这个效果很好，我想加快速度，因为我多次调用该函数在我的code（和它在参与我的项目计算的90％重新presents）。

我在找什么方法能改进这个code的运行时间不并行（即只有1个CPU）。我很乐意使用Python中的任何公开可用的软件包，或调用C或C ++（因为我已经听说，这由一个数量级的提高了运行时）。 preprocessing数据矩阵以Z 也将确定。这可以更好地计算被利用有些东西是矢量 RHO 通常是稀疏的（含50％左右的条目= 0），并通常有据比列更多的行（在大多数情况下 n_cols＆LT; = 100 ）

 导入时间
导入numpy的是NPNP .__配置__。节目（）了#make确保正在使用BLAS / LAPACK
np.random.seed（种子= 0）#initialize数据矩阵X和标签向量Y，
n_rows，n_cols = 1e6个电子，100
X = np.random.random（大小=（n_rows，n_cols））
Y = np.random.randint（低= 0，高= 2，大小=（n_rows，1））
Ÿ[Y == 0] = -1
Z = X * Y＃所有的操作都在z进行高清compute_logistic_loss_value_and_slope（RHO，Z）：
    #compute的值并在某种程度上物流损失函数的斜率是数值稳定
    #loss_value：（1×1）标量= 1 / n_rows *总和（日志（1 + EXP（-Z * RHO））
    #loss_slope：（n_cols×1）向量= 1 / n_rows *总和（-Z * RHO ./（1 + EXP（-Z * RHO））
    #see也：http://stackoverflow.com/questions/20085768/    分数= Z.dot（RHO）
    pos_idx =分数＆GT; 0
    exp_scores_pos = np.exp（-scores [pos_idx]）
    exp_scores_neg = np.exp（得分[〜pos_idx]）    #compute损耗值
    loss_value = np.empty_like（得分）
    loss_value [pos_idx] = np.log（1.0 + exp_scores_pos）
    loss_value [〜pos_idx] = -scores [〜pos_idx] + np.log（1.0 + exp_scores_neg）
    loss_value = loss_value.mean（）    #compute损失坡
    phi_slope = np.empty_like（得分）
    phi_slope [pos_idx] = 1.0 /（1.0 + exp_scores_pos）
    phi_slope [〜pos_idx] = exp_scores_neg /（1.0 + exp_scores_neg）
    loss_slope = Z.T.dot（phi_slope  -  1.0）/ Z.shape [0]    返回loss_value，loss_slope
#initialize整数向量，其中以上项的一半= 0
rho_test = np.random.randint（低= -10，高= 10，大小=（n_cols，1））
set_to_zero = np.random.choice（范围（0，n_cols），大小=（np.floor（n_cols / 2），1），替换=假）
rho_test [set_to_zero] = 0.0START_TIME =选定了time.time（）
loss_value，loss_slope = compute_logistic_loss_value_and_slope（rho_test，Z）
打印总运行时间=％1.5F秒％（选定了time.time（） -  START_TIME）

解决方案

的BLAS家族的图书馆已经高度调整为最佳性能。因此，没有力气链接到一些C / C ++ code是可能给你带来任何好处。然而，你可以尝试不同的BLAS实现，因为有相当多的人身边，其中包括一些专门调整，某些CPU。

这是我想到的另一件事是使用像 theano （或谷歌的图书馆< A HREF =https://www.tensorflow.org/相对=nofollow> tensorflow ）即能重新present整个计算图（所有的操作在功能上图）并应用全局优化它。然后，它可以从通过C的图形++（和翻转简单的切换也GPU code）产生CPU code。它还可以自动计算出象征性的衍生产品适合你。我用theano机器学习问题，它是一个真正伟大的图书馆，虽然不是一个最简单易学。

（我张贴这是一个答案，因为它太长了评论）

编辑：

其实，我曾在theano在这个一展身手，但结果实际上是在CPU上约2倍速度较慢，请参阅以下原因。我将它张贴在这里反正，也许这是为别人做的更好的东西的出发点：（这只是局部的code，完成从原来的职位的code）

 进口theano高清make_graph（RHO，Z）：
    分数= theano.tensor.dot（Z，ρ）    ＃这是非常低效的......这一切都计算两次，
    ＃然后挑选视得分为正或不是其中之一。
    ＃不知道如何以更有效的方式前preSS这theano
    POS = theano.tensor.log（1 + theano.tensor.exp（-scores））
    NEG = theano.tensor.log（得分+ theano.tensor.exp（分数））
    loss_value = theano.tensor.switch（得分大于0，POS，NEG）
    loss_value = loss_value.mean（）    然而＃计算衍生现在是一个真正的快乐：
    loss_slope = theano.tensor.grad（loss_value，ρ）    返回loss_value，loss_slopesym_rho = theano.tensor.col（RHO）
sym_Z = theano.tensor.matrix（'Z'）
sym_loss_value，sym_loss_slope = make_graph（sym_rho，sym_Z）compute_logistic_loss_value_and_slope = theano.function（
        输入= [sym_rho，sym_Z]
        输出= [sym_loss_value，sym_loss_slope]
        ）＃使用功能compute_logistic_loss_value_and_slope（）作为原code

I am currently working on a machine learning project where - given a data matrix Z and a vector rho - I have to compute the value and slope of the logistic loss function at rho. The computation involves basic matrix-vector multiplication and log/exp operations, with a trick to avoid numerical overflow (described in this previous post).

I am currently doing this in Python using NumPy as shown below (as a reference, this code runs in 0.2s). Although this works well, I would like to speed it up since I call the function multiple times in my code (and it represents over 90% of the computation involved in my project).

I am looking for any way to improve the runtime of this code without parallelization (i.e. only 1 CPU). I am happy using any publicly available package in Python, or calling C or C++ (since I have heard that this improves runtimes by an order of magnitude). Preprocessing the data matrix Z would also be OK. Some things that could be exploited for better computation are that the vector rho is usually sparse (with around 50% of entries = 0) and there are usually far more rows than columns (in most cases n_cols <= 100)

import time
import numpy as np

np.__config__.show() #make sure BLAS/LAPACK is being used
np.random.seed(seed = 0)

#initialize data matrix X and label vector Y
n_rows, n_cols = 1e6, 100
X = np.random.random(size=(n_rows, n_cols))
Y = np.random.randint(low=0, high=2, size=(n_rows, 1))
Y[Y==0] = -1
Z = X*Y # all operations are carried out on Z

def compute_logistic_loss_value_and_slope(rho, Z):
    #compute the value and slope of the logistic loss function in a way that is numerically stable
    #loss_value: (1 x 1) scalar = 1/n_rows * sum(log( 1 .+ exp(-Z*rho))
    #loss_slope: (n_cols x 1) vector = 1/n_rows * sum(-Z*rho ./ (1+exp(-Z*rho))
    #see also: http://stackoverflow.com/questions/20085768/

    scores = Z.dot(rho)
    pos_idx = scores > 0
    exp_scores_pos = np.exp(-scores[pos_idx])
    exp_scores_neg = np.exp(scores[~pos_idx])

    #compute loss value
    loss_value = np.empty_like(scores)
    loss_value[pos_idx] = np.log(1.0 + exp_scores_pos)
    loss_value[~pos_idx] = -scores[~pos_idx] + np.log(1.0 + exp_scores_neg)
    loss_value = loss_value.mean()

    #compute loss slope
    phi_slope = np.empty_like(scores)
    phi_slope[pos_idx]  = 1.0 / (1.0 + exp_scores_pos)
    phi_slope[~pos_idx] = exp_scores_neg / (1.0 + exp_scores_neg)
    loss_slope = Z.T.dot(phi_slope - 1.0) / Z.shape[0]

    return loss_value, loss_slope


#initialize a vector of integers where more than half of the entries = 0
rho_test = np.random.randint(low=-10, high=10, size=(n_cols, 1))
set_to_zero = np.random.choice(range(0,n_cols), size =(np.floor(n_cols/2), 1), replace=False)
rho_test[set_to_zero] = 0.0

start_time = time.time()
loss_value, loss_slope = compute_logistic_loss_value_and_slope(rho_test, Z)
print "total runtime = %1.5f seconds" % (time.time() - start_time)

解决方案

Libraries of the BLAS family are already highly tuned for best performance. So no effort to link to some C/C++ code is likely to give you any benefits. You could however try various BLAS implementations, since there are quite a few of them around, including some specially tuned to some CPUs.

The other thing that comes to my mind is to use a library like theano (or Google's tensorflow) that is able to represent the entire computational graph (all of the operations in your function above) and apply global optimizations to it. It can then generate CPU code from that graph via C++ (and by flipping a simple switch also GPU code). It can also automatically compute symbolic derivatives for you. I've used theano for machine learning problems and it's a really great library for that, although not the easiest one to learn.

(I'm posting this as an answer because it's too long for a comment)

Edit:

I actually had a go at this in theano, but the result is actually about 2x slower on the CPU, see below why. I'll post it here anyway, maybe it's a starting point for someone else to do something better: (this is only partial code, complete with the code from the original post)

import theano

def make_graph(rho, Z):
    scores = theano.tensor.dot(Z, rho)

    # this is very inefficient... it calculates everything twice and
    # then picks one of them depending on scores being positive or not.
    # not sure how to express this in theano in a more efficient way
    pos = theano.tensor.log(1 + theano.tensor.exp(-scores))
    neg = theano.tensor.log(scores + theano.tensor.exp(scores))
    loss_value = theano.tensor.switch(scores > 0, pos, neg)
    loss_value = loss_value.mean()

    # however computing the derivative is a real joy now:
    loss_slope = theano.tensor.grad(loss_value, rho)

    return loss_value, loss_slope

sym_rho = theano.tensor.col('rho')
sym_Z = theano.tensor.matrix('Z')
sym_loss_value, sym_loss_slope = make_graph(sym_rho, sym_Z)

compute_logistic_loss_value_and_slope = theano.function(
        inputs=[sym_rho, sym_Z],
        outputs=[sym_loss_value, sym_loss_slope]
        )

# use function compute_logistic_loss_value_and_slope() as in original code

这篇关于加快在Python中，矩阵向量乘法和乘方可能通过调用C / C ++的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

加快在Python中，矩阵向量乘法和乘方可能通过调用C / C ++ [英] Speeding up matrix-vector multiplication and exponentiation in Python, possibly by calling C/C++

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

加快在Python中，矩阵向量乘法和乘方可能通过调用C / C ++ [英] Speeding up matrix-vector multiplication and exponentiation in Python, possibly by calling C/C++

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭