计算相似度百分比或计算两个以上对象之间的相关性 [英] Compute similarity percentage OR Compute correlation between more than 2 objects
问题描述
考虑一下,我有四个对象(a,b,c,d
),我请五个人根据其外观或其他东西来标记它们(类别1或2).五个人为这些物体提供的标签显示为
Consider I have four objects (a,b,c,d
), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as
df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))
以表格格式,
---------
a b c d
---------
1 1 2 1
2 2 1 2
1 2 2 1
2 1 2 2
1 1 2 1
----------
现在,我想计算一组对象被赋予相同标签(1或2)的次数的百分比.例如,对象a,b和d在5个人中有3个人被赋予了相同的标签.因此其百分比为3/5(= 60%).由于对象a和d被所有人赋予相同的标签,因此其百分比为5/5(= 100%)
Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)
我可以手动计算该统计量,但是在我的原始数据集中,我有50个这样的对象,人是30,标签是4(1、2、3和4).如何自动为更大的数据集计算此类统计信息? R
中是否有任何可以计算此类统计信息的软件包/工具?
I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in R
which can calculate such statistics?
注意:一个组可以是任意大小.在第一个示例中,组由a,b和d组成,而在第二个示例中,组由a和d组成.
Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.
推荐答案
此处有两个任务:首先,列出所有相关组合的列表,其次,评估和汇总行相似性. combn
可以启动第一个任务,但是需要一点按摩才能将结果整理到一个整齐的列表中.第二个任务可以用prop.table
处理,但是在这里直接计算更简单.
There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. combn
can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with prop.table
, but here it's simpler to calculate directly.
在这里,我使用了tidyverse
语法(主要是purrr
,这对处理列表很有帮助),但是如果您愿意,可以转换为基数.
Here I've used tidyverse
grammar (primarily purrr
, which is helpful for handling lists), but convert into base if you like.
library(tidyverse)
map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>% # get combinations
flatten() %>% # eliminate nesting
set_names(map_chr(., paste0, collapse = '')) %>% # add useful names
# subset df with combination, see if each row has only one unique value
map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>%
map_dbl(~sum(.x) / length(.x)) # calculate TRUE proportion
## ab ac ad bc bd cd abc abd acd bcd abcd
## 0.6 0.2 1.0 0.2 0.6 0.2 0.0 0.6 0.2 0.0 0.0
这篇关于计算相似度百分比或计算两个以上对象之间的相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!