用分类和数值混合计算距离矩阵 [英] calculate distance matrix with mixed categorical and numerics

查看:55
本文介绍了用分类和数值混合计算距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含数字(15个字段)和分类(5个字段)数据.

I have a data frame with a mixture of numeric (15 fields) and categorical (5 fields) data.

我可以按照创建距离矩阵来创建数字字段的完整距离矩阵使用自己的计算熊猫

我也想包括分类字段.

用作模板:

import scipy
from scipy.spatial import distance_matrix
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist
df2=pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})
df2
pd.DataFrame(squareform(pdist(df2.values, lambda u, v: np.sqrt((w*(u-v)**2).sum()))), index=df2.index, columns=df2.index)

在平方计算中,我想包括测试 np.where(u [2] == v [2],0、10)(以及其他分类列))

in the squareform calculation, I would like to include the test np.where(u[2]==v[2], 0, 10) (as well as with the other categorical columns)

Hpw是否也要修改lambda函数以执行此测试

Hpw do I modify the lambda function to carry out this test as well

在这里[0,1]之间的距离

Here, the distance between [0,1]

= sqrt((2-1)^2 + (6-5)^2 + (cat - cat)^2)
= sqrt(1 + 1 + 0)

与[0,2]之间的距离

and the distance between [0,2]

= sqrt((3-1)^2 + (7-5)^2 + (dog - cat)^2)
= sqrt(4 + 4 + 100)

有人可以建议我如何实现此算法吗?

Can anyone suggest how I can implement this algorithm?

推荐答案

import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform

df2 = pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})

def fun(u,v):
    const = 0 if u[2] == v[2] else 10
    return np.sqrt((u[0]-v[0])**2 + (u[1]-v[1])**2 + const**2)

pd.DataFrame(squareform(pdist(df2.values, fun)), index=df2.index, columns=df2.index)

结果:

           0          1          2          3
0   0.000000   1.414214  10.392305  10.862780
1   1.414214   0.000000  10.099505  10.392305
2  10.392305  10.099505   0.000000  10.099505
3  10.862780  10.392305  10.099505   0.000000

这篇关于用分类和数值混合计算距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆