混合数值和分类数据的观测值之间成对距离计算的有效实现 [英] Efficient implementation of pairwise distances computation between observations for mixed numeric and categorical data

查看:59
本文介绍了混合数值和分类数据的观测值之间成对距离计算的有效实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个数据科学项目,其中我必须计算数据集中每对观测值之间的欧几里得距离.

I am working on a data science project in which I have to compute the euclidian distance between every pair of observations in a dataset.

由于我要处理非常大的数据集,因此必须使用高效的成对距离计算(在内存使用和计算时间方面).

Since I am working with very large datasets, I have to use an efficient implementation of pairwise distances computation (both in terms of memory usage and computation time).

一种解决方案是使用Scipy中的 pdist 函数,该函数以一维数组的形式返回结果,而没有重复的实例.

One solution is to use the pdist function from Scipy, which returns the result in a 1D array, without duplicate instances.

但是,此函数无法处理分类变量.对于这些,我想在值相同的情况下将距离设置为0,在其他情况下将距离设置为1.

However, this function is not able to deal with categorical variables. For these, I want to set the distance to 0 when the values are the same and 1 otherwise.

我尝试用Numba在Python中实现此变体.该函数将包含所有观测值的2D Numpy数组和包含变量类型( float64 category )的一维数组作为输入.

I have tried to implement this variant in Python with Numba. The function takes as input the 2D Numpy array containing all the observations and a 1D array containing the types of the variables (either float64 or category).

这是代码:

import numpy as np
from numba.decorators import autojit

def pairwise(X, types):
    m = X.shape[0]
    n = X.shape[1]

    D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float)
    ind = 0

    for i in range(m):
        for j in range(i+1, m):
            d = 0.0

            for k in range(n):
                if types[k] == 'float64':
                    tmp = X[i, k] - X[j, k]
                    d += tmp * tmp
                else:
                    if X[i, k] != X[j, k]:
                        d += 1.

            D[ind] = np.sqrt(d)
            ind += 1

    return D.reshape(1, -1)[0]

pairwise_numba = autojit(pairwise)

vectors = np.random.rand(20000, 100)
types = np.array(['float64']*100)

dists = pairwise_numba(vectors, types)

尽管使用了Numba,该实现仍然非常缓慢.是否可以改善我的代码以使其更快?

This implementation is very slow despite the use of Numba. Is it possible to improve my code to make it faster ?

推荐答案

如果您真的希望numba快速执行,则需要在 nopython 模式下 jit 该功能,否则numba可能会退回到较慢的对象模式(并且可能非常慢).

In case you really want numba to perform fast you need to jit the function in nopython mode, otherwise numba may fall back to object mode which is slower (and can be quite slow).

但是在nopython模式下无法编译函数(从numba版本0.43.1开始),这是因为:

However your function cannot be compiled (as of numba version 0.43.1) in nopython mode, that's because:

  • np.empty dtype 参数. np.float 只是Pythons float ,将由NumPy(但不是numba)翻译为 np.float _ .如果您使用numba,则必须使用它.
  • 缺少numba中的字符串支持.因此 types [k] =='float64'行将无法编译.
  • the dtype argument to np.empty. np.float is simply Pythons float and will be translated by NumPy (but not numba) to np.float_. If you use numba you have to use that.
  • String support in numba is lacking. So the types[k] == 'float64' line will not compile.

第一个问题很简单.关于第二个问题:提供一个布尔数组,而不是尝试使字符串比较起作用.与比较最多7个字符相比,使用布尔数组并评估一个布尔值是否也将显着更快.尤其是在最里面的循环中!

The first issue is trivially fixe. Regarding the second issue: instead of trying to make the string comparisons work just provide a boolean array. Using a boolean array and evaluating one boolean for thruthiness will also be significantly faster than comparing up to 7 characters. Especially if it's in the innermost loop!

所以它可能看起来像这样:

So it might look like this:

import numpy as np
import numba as nb

@nb.njit
def pairwise_numba(X, is_float_type):
    m = X.shape[0]
    n = X.shape[1]

    D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float64)  # corrected dtype
    ind = 0

    for i in range(m):
        for j in range(i+1, m):
            d = 0.0

            for k in range(n):
                if is_float_type[k]:
                    tmp = X[i, k] - X[j, k]
                    d += tmp * tmp
                else:
                    if X[i, k] != X[j, k]:
                        d += 1.

            D[ind] = np.sqrt(d)
            ind += 1

    return D.reshape(1, -1)[0]

dists = pairwise_numba(vectors, types == 'float64')  # pass in the boolean array

但是,如果将浮点类型上的 scipy.spatial.distances.pdist 与numba逻辑相结合以计数不相等的类别,则可以简化逻辑:

However you can simplify the logic if you combine scipy.spatial.distances.pdist on the float types with a numba logic to count the unequal categorials:

from scipy.spatial.distance import pdist

@nb.njit
def categorial_sum(X):
    m = X.shape[0]
    n = X.shape[1]
    D = np.zeros(int(m * (m - 1) / 2), dtype=np.float64)  # corrected dtype
    ind = 0

    for i in range(m):
        for j in range(i+1, m):
            d = 0.0
            for k in range(n):
                if X[i, k] != X[j, k]:
                    d += 1.
            D[ind] = d
            ind += 1

    return D

def pdist_with_categorial(vectors, types):
    where_float_type = types == 'float64'
    # calculate the squared distance of the float values
    distances_squared = pdist(vectors[:, where_float_type], metric='sqeuclidean')
    # sum the number of mismatched categorials and add that to the distances 
    # and then take the square root
    return np.sqrt(distances_squared + categorial_sum(vectors[:, ~where_float_type]))

它不会显着更快,但是它大大简化了numba函数中的逻辑.

It won't be significantly faster, but it drastically simplified the logic in the numba function.

然后,您还可以通过将平方距离传递给numba函数来避免创建其他数组:

Then you can also avoid the additional array creations by passing in the squared distances to the numba function:

@nb.njit
def add_categorial_sum_and_sqrt(X, D):
    m = X.shape[0]
    n = X.shape[1]
    ind = 0
    for i in range(m):
        for j in range(i+1, m):
            d = 0.0
            for k in range(n):
                if X[i, k] != X[j, k]:
                    d += 1.
            D[ind] = np.sqrt(D[ind] + d)
            ind += 1

    return D

def pdist_with_categorial(vectors, types):
    where_float_type = types == 'float64'
    distances_squared = pdist(vectors[:, where_float_type], metric='sqeuclidean')
    return add_categorial_sum_and_sqrt(vectors[:, ~where_float_type], distances_squared)

这篇关于混合数值和分类数据的观测值之间成对距离计算的有效实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆