如何使用numpy.where加快我numpy的循环() [英] How to speed up my numpy loop using numpy.where()

查看:203
本文介绍了如何使用numpy.where加快我numpy的循环()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经写了约有序Logit模型的功能,最近。结果
但是,在运行大数据时需要我大量的时间。结果
所以我想重写code和替代的 numpy.where 函数如果语句。结果
目前有关于我的新code一些问题,我不知道该怎么做。结果
如果你知道,请帮帮我。非常感谢你!

I have written a function about ordered logit model, recently.
But it takes me lots of time when running big data.
So I want to rewrite the code and substitute numpy.where function to if statement.
There have some problem about my new code, I don't know how to do it.
If you know, Please help me. Thank you very much!

这是我原来的功能。

import numpy as np
from scipy.stats import logistic

def func(y, X, thresholds):
    ll = 0.0
    for row in zip(y, X):
        if row[0] == 0:
           ll += logistic.logcdf(thresholds[0] - row[1])
        elif row[0] == len(thresholds):
           ll += logistic.logcdf(row[1] - thresholds[-1])
        else:
           for i in xrange(1, len(thresholds)):
               if row[0] == i:
                   diff_prob = logistic.cdf(thresholds[i] - row[1]) - logistic.cdf(thresholds[i - 1] - row[1])
                   if diff_prob <= 10 ** -5:
                       ll += np.log(10 ** -5)
                   else:
                       ll += np.log(diff_prob)
     return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print func(y, X, thresholds)

这是新的,但并不完美code。

y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
ll = np.where(y == 0, logistic.logcdf(thresholds[0] - X),
          np.where(y == len(thresholds), logistic.logcdf(X - thresholds[-1]),
                   np.log(logistic.cdf(thresholds[1] - X) - logistic.cdf(thresholds[0] - X))))
print ll.sum()

问题是,我不知道如何重写子环路(作为我的xrange(1,LEN(阈值)):)功能

推荐答案

我想询问如何只用实现它 np.where 是位的的 X / Y的问题

I think asking how to implement it just using np.where is a bit of an X/Y problem.

所以我会尽力解释我是如何将接近优化该功能。

So I'll try to explain how I would approach optimizing this function.

我的第一直觉就是摆脱了循环,这是痛点反正:

My first instinct is to get rid of the for loop, which was the pain point anyway:

import numpy as np
from scipy.stats import logistic

def func1(y, X, thresholds):
    ll = 0.0
    for row in zip(y, X):
        if row[0] == 0:
            ll += logistic.logcdf(thresholds[0] - row[1])
        elif row[0] == len(thresholds):
            ll += logistic.logcdf(row[1] - thresholds[-1])
        else:
            diff_prob = logistic.cdf(thresholds[row[0]] - row[1]) - \
                         logistic.cdf(thresholds[row[0] - 1] - row[1])
            diff_prob = 10 ** -5 if diff_prob < 10 ** -5 else diff_prob
            ll += np.log(diff_prob)
    return ll

y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func1(y, X, thresholds))

我刚刚更换 I 行[0] ,在不改变循环的语义。所以这是一个循环少。

I have just replaced i with row[0], without changing the semantics of the loop. So that's one for loop less.

现在,我想在不同分支语句的格式的的的if-else 是相同的。为此:

Now I would like to have the form of the statements in the different branches of the if-else to be the same. To that end:

import numpy as np
from scipy.stats import logistic

def func2(y, X, thresholds):
    ll = 0.0

    for row in zip(y, X):
        if row[0] == 0:
            ll += logistic.logcdf(thresholds[0] - row[1])
        elif row[0] == len(thresholds):
            ll += logistic.logcdf(row[1] - thresholds[-1])
        else:
            ll += np.log(
                np.maximum(
                    10 ** -5, 
                    logistic.cdf(thresholds[row[0]] - row[1]) -
                     logistic.cdf(thresholds[row[0] - 1] - row[1])
                )
            )
    return ll

y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func2(y, X, thresholds))

现在每个分支的前pression的形式为 LL + = EXPR

Now the expression in each branch is of the form ll += expr.

目前PIONT有几个不同的路径优化可以采取的。您可以尝试写它作为一个COM prehension优化循环了,但我怀疑这会不会给你太多的速度增长。

At this piont there are a couple of different paths the optimization can take. You can try to optimize the loop away by writing it as a comprehension, but I suspect that it'll not give you much increase in speed.

这是另一条路径就是拉如果条件跳出循环。这就是你的意图与 np.where 是还有:

An alternate path is to pull the if conditions out of the loop. That is what your intent with np.where was as well:

import numpy as np
from scipy.stats import logistic

def func3(y, X, thresholds):
    y_0 = y == 0
    y_end = y == len(thresholds)
    y_rest = ~(y_0 | y_end)

    ll_1 = logistic.logcdf(thresholds[0] - X[ y_0 ])
    ll_2 = logistic.logcdf(X[ y_end ] - thresholds[-1])
    ll_3 = np.log(
        np.maximum(
            10 ** -5, 
            logistic.cdf(thresholds[y[ y_rest ]] - X[ y_rest ]) -
              logistic.cdf(thresholds[ y[y_rest] - 1 ] - X[ y_rest])
        )
    )
    return np.sum(ll_1) + np.sum(ll_2) + np.sum(ll_3)

y = np.array([0, 1, 2])
X = np.array([2, 2, 2])
thresholds = np.array([2, 3])
print(func3(y, X, thresholds))

请注意,我转身 X np.array ,以便能够在其上使用花哨的索引。

Note that I turned X into an np.array to be able to use fancy indexing on it.

在这一点上,我打赌这是我的目的不够快。但是,您可以提前或停止超过此点,根据您的要求。

At this point, I'd wager that it is fast enough for my purposes. However, you can stop earlier or beyond this point, depending on your requirements.

在我的电脑,我得到如下结果:

On my computer, I get the following results:

y = np.random.random_integers(0, 10, size=(10000,))
X = np.random.random_integers(0, 10, size=(10000,))
thresholds = np.cumsum(np.random.rand(10))

%timeit func(y, X, thresholds) # Original
1 loops, best of 3: 1.51 s per loop

%timeit func1(y, X, thresholds) # Removed for-loop
1 loops, best of 3: 1.46 s per loop

%timeit func2(y, X, thresholds) # Standardized if statements
1 loops, best of 3: 1.5 s per loop

%timeit func3(y, X, thresholds) # Vectorized ~ 500x improvement
100 loops, best of 3: 2.74 ms per loop

这篇关于如何使用numpy.where加快我numpy的循环()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆