sklearn 的成对距离结果出乎意料地不对称 [英] sklearn's pairwise distance result is unexpectedly asymmetrical

查看:59
本文介绍了sklearn 的成对距离结果出乎意料地不对称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在计算向量元素之间的欧几里得成对距离.我使用 sklearn 包中的 pairwise_distances 函数.然而,某些元素的结果矩阵仅近似对称:在一个示例中,应该相等的元素的值仅等于小数点后 15 位.

I am calculating the euclidean pairwise distance between elements of a vector. I use the pairwise_distances function from sklearn package. However the resulting matrix for some elements is only approximately symmetrical: The values of elements that are supposed to be equal, are only equal up to 15 digits behind the decimal point in one example.

我意识到这一点是因为我在假设输入矩阵对称的下游分析中遇到错误.我知道我可以四舍五入,但这是什么原因造成的?!

I realized this as I was getting errors in the downstream analysis which assumed symmetry of input matrices. I know I can round values up, but what is causing this?!

这是我试图计算成对距离的向量(它是熊猫数据框的一列):

Here is the vector I am trying to calculate the pair wise distance for (it is a column of a pandas dataframe):

lag_measure_data[['bios_level']].values

array([[ 0.76881030949999995538490793478558771312236785888671875 ],
   [ 0.                                                      ],
   [ 0.67783090619999997183953155399649403989315032958984375 ],
   [ 0.3228176074999999922710003374959342181682586669921875  ],
   [ 0.75822395549999999087020796650904230773448944091796875 ],
   [ 0.469808621599999975959605080788605846464633941650390625],
   [ 0.989529862699999984698706612107343971729278564453125   ],
   [ 0.                                                      ],
   [ 0.5575436799999999859522858969285152852535247802734375  ],
   [ 0.9756440299999999954394525047973729670047760009765625  ],
   [ 0.66511863289999995085821637985645793378353118896484375 ],
   [ 0.978062709200000046649847718072123825550079345703125   ],
   [ 0.473957179800000016900440868994337506592273712158203125],
   [ 0.82409385540000001935112550199846737086772918701171875 ],
   [ 0.56548685279999999497846374651999212801456451416015625 ],
   [ 0.399505730399999980928527065771049819886684417724609375],
   [ 0.474232963900000026313819034839980304241180419921875   ],
   [ 0.34276307189999999369689476225175894796848297119140625 ],
   [ 0.9985316859999999739017084721126593649387359619140625  ],
   [ 0.9063241512999999915933813099400140345096588134765625  ],
   [ 0.                                                      ]])

这是我用来获取距离矩阵的命令:

Here is the command I use to get the distance matrix:

d_matrix_lag = pairwise_distances(lag_measure_data[['bios_level']].values)

我不在这里打印输出距离矩阵,因为它太乱了,但作为第一行的例子,第 4 列的值是

I don't print the output distance matrix here as it is too messy but as an example in the first row the value for the 4th column is

0.445992701999999907602756366031826473772525787353515625

0.445992701999999907602756366031826473772525787353515625

而第四行第一列的值为

0.4459927019999998520916051347739994525909423828125

0.4459927019999998520916051347739994525909423828125

推荐答案

我可以在对称性测试中重现你的错误:

i could reproduce your error my testing for symmetry:

import numpy as np

a = np.array([[ 0.76881030949999995538490793478558771312236785888671875 ],
   [ 0.                                                      ],
   [ 0.67783090619999997183953155399649403989315032958984375 ],
   [ 0.3228176074999999922710003374959342181682586669921875  ],
   [ 0.75822395549999999087020796650904230773448944091796875 ],
   [ 0.469808621599999975959605080788605846464633941650390625],
   [ 0.989529862699999984698706612107343971729278564453125   ],
   [ 0.                                                      ],
   [ 0.5575436799999999859522858969285152852535247802734375  ],
   [ 0.9756440299999999954394525047973729670047760009765625  ],
   [ 0.66511863289999995085821637985645793378353118896484375 ],
   [ 0.978062709200000046649847718072123825550079345703125   ],
   [ 0.473957179800000016900440868994337506592273712158203125],
   [ 0.82409385540000001935112550199846737086772918701171875 ],
   [ 0.56548685279999999497846374651999212801456451416015625 ],
   [ 0.399505730399999980928527065771049819886684417724609375],
   [ 0.474232963900000026313819034839980304241180419921875   ],
   [ 0.34276307189999999369689476225175894796848297119140625 ],
   [ 0.9985316859999999739017084721126593649387359619140625  ],
   [ 0.9063241512999999915933813099400140345096588134765625  ],
   [ 0.                                                      ]])

from sklearn.metrics.pairwise import pairwise_distances
dist_sklearn = pairwise_distances(a)
print((dist_sklearn.transpose() == dist_sklearn).all())

得到 False 作为输出.尝试使用 scipy.spatial.distance 代替.您将获得成对距离计算的距离向量,但可以使用 squareform() 将其转换为距离矩阵

getting False as output. Try to use scipy.spatial.distance instead. You will get a distance vector of the pairwise distance computation but can convert it to a distance matrix with squareform()

from scipy.spatial.distance import pdist, squareform

dist = pdist(a)
sq = squareform(dist)
print((sq.transpose() == sq).all())

这给了我对称矩阵.希望这有帮助

This gave me symmetric matrix. Hope this helps

这篇关于sklearn 的成对距离结果出乎意料地不对称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆