计算 pandas 数据帧中的联合的交集(Jaccard指数) [英] Calculate intersection over union (Jaccard's index) in pandas dataframe

查看:42
本文介绍了计算 pandas 数据帧中的联合的交集(Jaccard指数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,例如:

 动物ID 
猫1,3,4
狗1,2, 4
仓鼠5
海豚3,5

数据帧很大,超过8万个行和ID列可能会轻易包含数千甚至10万个以逗号分隔的ID。给定行中的ID在逗号分隔的字符串中将是唯一的。


我想构建一个数据帧,该数据帧计算Jaccard的索引,即动物列中的每个项目在id中的相交


因此,如果我们看一下猫和狗,则联合为2(id 1和4),联合为4(id 1、2、3、4),因此,Jaccard的指数为2/4 = 0.5。拥有以下格式的数据集将是很棒的:

 猫狗仓鼠海豚
猫1 0.5 0 0.25
狗0.5 1 0 0
仓鼠0 0 1 0.5
海豚0.25 0 0.5 1

行索引作为动物的名称,这样我就可以快速找到相关的jaccard索引,例如:

  cat_dog_ji = df_new ['cat'] [ 'dog'] 


解决方案

您可以使用 str.get_dummies 和一些 scipy 工具。




<$ p来自scipy的$ p> 。空间导入距离

u = df [ ids]。str.get_dummies(,)
j = distance.pdist (u, jaccard)
k = df [动物] .to_numpy()
pd.DataFrame(1-distance.squareform(j),index = k,columns = k)




 猫狗仓鼠海豚
猫1.00 0.5 0.0 0.25
狗0.50 1.0 0.0 0.00
仓鼠0.00 0.0 1.0 0.50
海豚0.25 0.0 0.5 1.00


I have a dataframe like:

animal    ids
cat       1,3,4
dog       1,2,4
hamster   5        
dolphin   3,5

The dataframe is quite big, with over 80 thousand rows, and ids column may contain easily over thousands, even 10 thousands comma separated id. Ids in a given row would be unique in the comma separated string.

I would like to construct a dataframe which calculated Jaccard's index, i.e. intersection of each items in animal column with each other in ids column over union.

So if we look at cat and dog, the union is 2 (ids 1 and 4), and union is 4 (ids 1, 2, 3, 4), hence the Jaccard's index is 2/4 = 0.5. It would be great to have the dataset in this format:

            cat        dog        hamster    dolphin
cat         1          0.5        0          0.25
dog         0.5        1          0          0
hamster     0          0          1          0.5
dolphin     0.25       0          0.5        1

which means using the row index as the name of the animal, so that I can find related jaccard's index quickly like:

cat_dog_ji = df_new['cat']['dog']

解决方案

You can use str.get_dummies and some scipy tools here.


from scipy.spatial import distance

u = df["ids"].str.get_dummies(",")
j = distance.pdist(u, "jaccard")
k = df["animal"].to_numpy()
pd.DataFrame(1 - distance.squareform(j), index=k, columns=k)


          cat  dog  hamster  dolphin
cat      1.00  0.5      0.0     0.25
dog      0.50  1.0      0.0     0.00
hamster  0.00  0.0      1.0     0.50
dolphin  0.25  0.0      0.5     1.00

这篇关于计算 pandas 数据帧中的联合的交集(Jaccard指数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆