在Python中计算余弦距离的优化方法 [英] Optimized method for calculating cosine distance in Python

查看：742 发布时间：2016/6/1 19:53:39 python arrays optimization distance

本文介绍了在Python中计算余弦距离的优化方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我写来计算两个阵列之间的余弦距离的方法：

  DEF cosine_distance（A，B）：
    如果LEN（A）= LEN（B）！
    返回False
    分子= 0
    denoma = 0
    denomb = 0
    因为我在范围内（LEN（A））：
    分子+ = A [I] * B [I]
    denoma + = ABS（A []）** 2
    denomb + = ABS（B [I]）** 2
    结果= 1  - 分子/（开方（denoma）* SQRT（denomb））
    返回结果

运行它可以是一个大阵列上非常缓慢。是否有此方法的优化版本将运行得更快？

更新：我已经尝试了所有的建议，到今天为止，包括SciPy的。这里的版本击败，吸收麦克和史蒂夫建议：

  DEF cosine_distance（A，B）：
    如果LEN（A）= LEN（B）！
    提高ValueError错误，a和b必须是相同的长度#Steve
    分子= 0
    denoma = 0
    denomb = 0
    因为我在范围内（LEN（A））：＃Mike的优化：
    AI = A [I] #only计算一次
    BI = B [I]
    分子+ =嗳*双向#faster比指数（勉强）
    denoma + =嗳嗳* ABS #strip（），因为它的平方
    denomb + =双向双向*
    结果= 1  - 分子/（开方（denoma）* SQRT（denomb））
    返回结果

解决方案

如果你可以使用SciPy的，你可以使用余弦从 spatial.distance ：

<一个href=\"http://docs.scipy.org/doc/scipy/reference/spatial.distance.html\">http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

如果您不能使用SciPy的，你可以尝试通过重写你的Python获得小加速（编辑：但它没有工作，喜欢我认为它会，见下文）。

 从和itertools导入izip
从数学进口开方DEF cosine_distance（A，B）：
    如果LEN（A）= LEN（B）！
        提高ValueError错误，a和b必须是相同的长度
    分子=总和（和设定[0] *和设定[1]中izip TUP（A，B））
    denoma = SUM（安勤** 2的安勤）
    denomb =总和（对于b中bvalue bvalue ** 2）
    结果= 1  - 分子/（开方（denoma）* SQRT（denomb））
    返回结果

这是更好地提高时的a和b的长度是失配的一个例外。

通过使用发电机前pressions电话里面和（）与大部分工作正在由C code做，你可以计算出你的价值观里面的Python。这应该是比使用为循环更快。

我没有超时这个，所以我不能猜测如何更快它可能是。但SciPy的code几乎可以肯定是用C或C ++，它应该是一样快，你可以得到的。

如果你在Python中的生物信息学，你真的应该使用SciPy的反正。

编辑：大流士培根定时我的code，发现它更慢。所以我计时我的code和......是的，这是比较慢。所有的教训：当你试图加快速度，不用猜，测量

我百思不得其解，为什么我试图把Python中的C-内部更多的工作比较慢。我试了一下长度为1000的列表，它仍然较慢。

我不能花试图巧妙地破解了Python更多的时间。如果您需要更多的速度，我建议你尝试SciPy的。

编辑：我只是用手测试，没有timeit。我发现，短期a和b，老code是速度更快;长a和b，新的code是快;在这两种情况下的差别并不大。（我现在想知道如果我能在我的Windows计算机上的信任timeit;我想在Linux上再次尝试这个测试）我不会改变工作code，试图更快地得到它。而一次我劝你去尝试SciPy的。： - ）

I wrote a method to calculate the cosine distance between two arrays:

def cosine_distance(a, b):
    if len(a) != len(b):
    	return False
    numerator = 0
    denoma = 0
    denomb = 0
    for i in range(len(a)):
    	numerator += a[i]*b[i]
    	denoma += abs(a[i])**2
    	denomb += abs(b[i])**2
    result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
    return result

Running it can be very slow on a large array. Is there an optimized version of this method that would run faster?

Update: I've tried all the suggestions to date, including scipy. Here's the version to beat, incorporating suggestions from Mike and Steve:

def cosine_distance(a, b):
    if len(a) != len(b):
    	raise ValueError, "a and b must be same length" #Steve
    numerator = 0
    denoma = 0
    denomb = 0
    for i in range(len(a)):       #Mike's optimizations:
    	ai = a[i]             #only calculate once
    	bi = b[i]
    	numerator += ai*bi    #faster than exponent (barely)
    	denoma += ai*ai       #strip abs() since it's squaring
    	denomb += bi*bi
    result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
    return result

解决方案

If you can use SciPy, you can use cosine from spatial.distance:

http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

If you can't use SciPy, you could try to obtain a small speedup by rewriting your Python (EDIT: but it didn't work out like I thought it would, see below).

from itertools import izip
from math import sqrt

def cosine_distance(a, b):
    if len(a) != len(b):
        raise ValueError, "a and b must be same length"
    numerator = sum(tup[0] * tup[1] for tup in izip(a,b))
    denoma = sum(avalue ** 2 for avalue in a)
    denomb = sum(bvalue ** 2 for bvalue in b)
    result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
    return result

It is better to raise an exception when the lengths of a and b are mismatched.

By using generator expressions inside of calls to sum() you can calculate your values with most of the work being done by the C code inside of Python. This should be faster than using a for loop.

I haven't timed this so I can't guess how much faster it might be. But the SciPy code is almost certainly written in C or C++ and it should be about as fast as you can get.

If you are doing bioinformatics in Python, you really should be using SciPy anyway.

EDIT: Darius Bacon timed my code and found it slower. So I timed my code and... yes, it is slower. The lesson for all: when you are trying to speed things up, don't guess, measure.

I am baffled as to why my attempt to put more work on the C internals of Python is slower. I tried it for lists of length 1000 and it was still slower.

I can't spend any more time on trying to hack the Python cleverly. If you need more speed, I suggest you try SciPy.

EDIT: I just tested by hand, without timeit. I find that for short a and b, the old code is faster; for long a and b, the new code is faster; in both cases the difference is not large. (I'm now wondering if I can trust timeit on my Windows computer; I want to try this test again on Linux.) I wouldn't change working code to try to get it faster. And one more time I urge you to try SciPy. :-)

这篇关于在Python中计算余弦距离的优化方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python中计算余弦距离的优化方法 [英] Optimized method for calculating cosine distance in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Python中计算余弦距离的优化方法 [英] Optimized method for calculating cosine distance in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭