有关如何加快距离计算的建议 [英] Suggestions on how to speed up a distance calculation

查看:77
本文介绍了有关如何加快距离计算的建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下课程:

class SquareErrorDistance(object):
    def __init__(self, dataSample):
        variance = var(list(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, u, v): # u and v are floats
        return (u - v) ** 2 * self._norm

我用它来计算向量的两个元素之间的距离.我基本上为使用该距离度量的矢量的每个维度(该维度使用其他距离度量)创建该类的一个实例.分析显示,此类的__call__函数占了我的knn实现(可能会想到的)的运行时间的90%.我不认为有任何纯Python的方法可以加快速度,但是如果我用C实现它呢?

I use it to calculate the distance between two elements of a vector. I basically create one instance of that class for every dimension of the vector that uses this distance measure (there are dimensions that use other distance measures). Profiling reveals that the __call__ function of this class accounts for 90% of the running-time of my knn-implementation (who would have thought). I do not think there is any pure-Python way to speed this up, but maybe if I implement it in C?

如果我运行一个简单的C程序,该程序使用上述公式为随机值计算距离,则它比Python快几个数量级.因此,我尝试使用 ctypes 并调用执行计算但显然是转换的C函数参数和返回值非常昂贵,因为生成的代码要慢得多.

If I run a simple C program that just calculates distances for random values using the formula above, it is orders of magnitude faster than Python. So I tried using ctypes and call a C function that does the computation, but apparently the conversion of the parameters and return-values is far to expensive, because the resulting code is much slower.

我当然可以在C中实现整个knn并调用它,但是问题是,正如我所描述的,我对向量的某些维度使用了不同的距离函数,将它们转换为C会花费很多工作

I could of course implement the entire knn in C and just call that, but the problem is that, like I described, I use different distance functions for some dimension of the vectors, and translating these to C would be too much work.

那我有什么选择?使用 Python C-API 编写C函数是否可以消除开销?还有其他方法可以加快计算速度吗?

So what are my alternatives? Will writing the C-function using the Python C-API get rid of the overhead? Are there any other ways to speed this calculation up?

推荐答案

以下cython代码(我意识到__init__的第一行是不同的,我用随机的东西替换了它,因为我不知道var并且因为无论如何都没关系-您说__call__是瓶颈):

The following cython code (I realize the first line of __init__ is different, I replaced it with random stuff because I don't know var and because it doesn't matter anyway - you stated __call__ is the bottleneck):

cdef class SquareErrorDistance:
    cdef double _norm

    def __init__(self, dataSample):
        variance = round(sum(dataSample)/len(dataSample))
        if variance == 0:
            self._norm = 1.0
        else:
            self._norm = 1.0 / (2 * variance)

    def __call__(self, double u, double v): # u and v are floats
        return (u - v) ** 2 * self._norm

通过简单的setup.py(只是文档(文件名已更改),它在一个简单的timeit基准测试中的性能比同等的纯python好近20倍.请注意,对于_norm字段和__call__参数,唯一的更改是cdef.我认为这令人印象深刻.

Compiled via a simple setup.py (just the example from the docs with the file name altered), it performs nearly 20 times better than the equivalent pure python in a simple contrieved timeit benchmark. Note that the only changed were cdefs for the _norm field and the __call__ parameters. I consider this pretty impressive.

这篇关于有关如何加快距离计算的建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆