与 pandas 成对的欧几里得距离忽略NaNs [英] Pairwise Euclidean distance with pandas ignoring NaNs

查看:68
本文介绍了与 pandas 成对的欧几里得距离忽略NaNs的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从字典开始,这是我的数据已被格式化的方式:

I start with a dictionary, which is the way my data was already formatted:

import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0}, 
'C':{'b':1.0,'c':2.0, 'd':4.0}}

然后将其转换为熊猫数据框:

I then convert it to a pandas dataframe:

df = pd.DataFrame(dict2)
print(df)
     A    B    C
a  1.0  2.0  NaN
b  2.0  NaN  1.0
c  NaN  2.0  2.0
d  4.0  5.0  4.0

当然,通过执行以下操作,我可以一次得到一个差异:

Of course, I can get the difference one at a time by doing this:

df['A'] - df['B']
Out[643]: 
a   -1.0
b    NaN
c    NaN
d   -1.0
dtype: float64

我想出了如何遍历和计算AA ,AB,AC:

I figured out how to loop through and calculate A-A, A-B, A-C:

for column in df:
print(df['A'] - df[column])

a    0.0
b    0.0
c    NaN
d    0.0
Name: A, dtype: float64
a   -1.0
b    NaN
c    NaN
d   -1.0
dtype: float64
a    NaN
b    1.0
c    NaN
d    0.0
dtype: float64

我想要做的是遍历各列以便计算| AB |,| AC |和| BC |并将结果存储在另一个字典中。

What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.

我要执行此操作,以便稍后计算所有列组合之间的欧几里得距离。如果有更简单的方法可以做到这一点,我也希望看到它。谢谢。

I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.

推荐答案

您可以使用numpy广播来计算向量化的欧几里得距离(L2-范数),而忽略使用<$ c的NaN $ c> np.nansum 。

You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.

i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5

如果您想要一个表示距离矩阵的DataFrame ,如下所示:

If you want a DataFrame representing a distance matrix, here's what that would look like:

df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
          A         B    C
A  0.000000  1.414214  1.0
B  1.414214  0.000000  1.0
C  1.000000  1.000000  0.0

df [i,j] 表示第i个和第j th 列在原始DataFrame中。

df[i, j] represents the distance between the ith and jth column in the original DataFrame.

这篇关于与 pandas 成对的欧几里得距离忽略NaNs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆