如何计算R中表中所有行的相似度? [英] How to calculate the similarity for all the rows in a table in R?

查看:47
本文介绍了如何计算R中表中所有行的相似度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算表中每一行的相似度(2 个数据对象的相似程度的数值度量 - 在这种情况下,2 行的相似程度),该表将如下所示:

I would like to calculate the similarity (Numerical measure of how alike 2 data objects are - in this case, how alike 2 rows are) of each row in a table, and the table will be like:

vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
vhigh,vhigh,2,2,big,low,unacc
vhigh,vhigh,2,2,big,med,unacc
vhigh,vhigh,2,2,big,high,unacc

我在互联网上尝试了很多不同的方法,但大多数都是用于计算矩阵的相似度.显然,我们可以很容易地分辨出第一行和第二行最相似",因为它们只有一个不同的变量,但我需要一种一次性的方法来比较此表的每一行.

I tried many different ways on the internet, but most of them are for calculating similarity for a matrix. Obviously, we can easily tell the first and second row are "most similar" because they only have one different variable, but I need a one-time way to compare each row of this table.

结果可能是这样的:第一行和第二行的相似度为 0.983.

The outcome may be like: the similarity of the first and the second row is 0.983.

推荐答案

这实质上是计算相同元素的比例.首先,我创建数据框:

This essentially calculates the proportion of elements that are the same. First, I create the data frame:

# Create data frame
data <- read.table(text = "vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
           vhigh,vhigh,2,2,small,high,unacc
           vhigh,vhigh,2,2,med,low,unacc
           vhigh,vhigh,2,2,med,med,unacc
           vhigh,vhigh,2,2,med,high,unacc
           vhigh,vhigh,2,2,big,low,unacc
           vhigh,vhigh,2,2,big,med,unacc
           vhigh,vhigh,2,2,big,high,unacc", sep = ",")

接下来,我加载 dplyr.

# Load dplyr library
library(dplyr)

这是完成所有工作的函数.

This is the function that does all the work.

# Function for comparing rows
row_cf <- function(x, y, df){
  sum(df[x,] == df[y,])/ncol(df)
}

这里就应用了.

# 1) Create all possible row combinations
# 2) Rename the columns for readability
# 3) Run through each row
# 4) Calculate similarity
res <- expand.grid(1:nrow(data), 1:nrow(data)) %>% 
  rename(row_1 = Var1, row_2 = Var2) %>% 
  rowwise() %>% 
  mutate(similarity = row_cf(row_1, row_2, data))

# Results
#    row_1 row_2 similarity
# 1      1     1  1.0000000
# 2      2     1  0.8571429
# 3      3     1  0.7142857
# 4      4     1  0.7142857
# 5      5     1  0.5714286
# 6      6     1  0.5714286
# 7      7     1  0.7142857
# 8      8     1  0.5714286
# 9      9     1  0.5714286
# 10     1     2  0.8571429
# 11     2     2  1.0000000
# 12     3     2  0.7142857
# 13     4     2  0.5714286
# 14     5     2  0.7142857
# 15     6     2  0.5714286
# 16     7     2  0.5714286
# 17     8     2  0.7142857
# 18     9     2  0.5714286
# 19     1     3  0.7142857
# 20     2     3  0.7142857
# 21     3     3  1.0000000
# 22     4     3  0.7142857
# 23     5     3  0.7142857
# 24     6     3  0.8571429
# 25     7     3  0.7142857
# 26     8     3  0.7142857
# 27     9     3  0.8571429
# 28     1     4  0.7142857
# 29     2     4  0.5714286
# 30     3     4  0.7142857
# 31     4     4  1.0000000
# 32     5     4  0.8571429
# 33     6     4  0.8571429
# 34     7     4  0.8571429
# 35     8     4  0.7142857
# 36     9     4  0.7142857
# 37     1     5  0.5714286
# 38     2     5  0.7142857
# 39     3     5  0.7142857
# 40     4     5  0.8571429
# 41     5     5  1.0000000
# 42     6     5  0.8571429
# 43     7     5  0.7142857
# 44     8     5  0.8571429
# 45     9     5  0.7142857
# 46     1     6  0.5714286
# 47     2     6  0.5714286
# 48     3     6  0.8571429
# 49     4     6  0.8571429
# 50     5     6  0.8571429
# 51     6     6  1.0000000
# 52     7     6  0.7142857
# 53     8     6  0.7142857
# 54     9     6  0.8571429
# 55     1     7  0.7142857
# 56     2     7  0.5714286
# 57     3     7  0.7142857
# 58     4     7  0.8571429
# 59     5     7  0.7142857
# 60     6     7  0.7142857
# 61     7     7  1.0000000
# 62     8     7  0.8571429
# 63     9     7  0.8571429
# 64     1     8  0.5714286
# 65     2     8  0.7142857
# 66     3     8  0.7142857
# 67     4     8  0.7142857
# 68     5     8  0.8571429
# 69     6     8  0.7142857
# 70     7     8  0.8571429
# 71     8     8  1.0000000
# 72     9     8  0.8571429
# 73     1     9  0.5714286
# 74     2     9  0.5714286
# 75     3     9  0.8571429
# 76     4     9  0.7142857
# 77     5     9  0.7142857
# 78     6     9  0.8571429
# 79     7     9  0.8571429
# 80     8     9  0.8571429
# 81     9     9  1.0000000

这篇关于如何计算R中表中所有行的相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆