与 pandas 成对的欧几里得距离忽略NaNs [英] Pairwise Euclidean distance with pandas ignoring NaNs
问题描述
我从字典开始,这是我的数据已被格式化的方式:
I start with a dictionary, which is the way my data was already formatted:
import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},
'C':{'b':1.0,'c':2.0, 'd':4.0}}
然后将其转换为熊猫数据框:
I then convert it to a pandas dataframe:
df = pd.DataFrame(dict2)
print(df)
A B C
a 1.0 2.0 NaN
b 2.0 NaN 1.0
c NaN 2.0 2.0
d 4.0 5.0 4.0
当然,通过执行以下操作,我可以一次得到一个差异:
Of course, I can get the difference one at a time by doing this:
df['A'] - df['B']
Out[643]:
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
我想出了如何遍历和计算AA ,AB,AC:
I figured out how to loop through and calculate A-A, A-B, A-C:
for column in df:
print(df['A'] - df[column])
a 0.0
b 0.0
c NaN
d 0.0
Name: A, dtype: float64
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
a NaN
b 1.0
c NaN
d 0.0
dtype: float64
我想要做的是遍历各列以便计算| AB |,| AC |和| BC |并将结果存储在另一个字典中。
What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.
我要执行此操作,以便稍后计算所有列组合之间的欧几里得距离。如果有更简单的方法可以做到这一点,我也希望看到它。谢谢。
I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.
推荐答案
您可以使用numpy广播来计算向量化的欧几里得距离(L2-范数),而忽略使用<$ c的NaN $ c> np.nansum 。
You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum
.
i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5
如果您想要一个表示距离矩阵的DataFrame ,如下所示:
If you want a DataFrame representing a distance matrix, here's what that would look like:
df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
A B C
A 0.000000 1.414214 1.0
B 1.414214 0.000000 1.0
C 1.000000 1.000000 0.0
df [i,j]
表示第i个和第j th 列在原始DataFrame中。
df[i, j]
represents the distance between the ith and jth column in the original DataFrame.
这篇关于与 pandas 成对的欧几里得距离忽略NaNs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!