从原始卡片排序数据创建相似度矩阵 [英] Creating a Similarity Matrix from Raw Card-Sort Data

查看:91
本文介绍了从原始卡片排序数据创建相似度矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自在线卡片分类活动的数据集.向参与者展示了随机的纸牌子集(来自较大的一组),并要求创建他们认为彼此相似的纸牌组.参与者可以创建任意数量的组,并根据需要命名组.

I have a data set from an online card sorting activity. Participants were presented with a random subset of Cards (from a larger set) and asked to create Groups of Cards they felt were similar to one another. Participants were able to create as many Groups as they liked and name the Groups whatever they wanted.

示例数据集如下:

Data <- structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L), Card = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 2L, 3L, 5L, 7L, 9L, 10L, 11L, 12L, 13L, 14L, 
1L, 3L, 4L, 5L, 6L, 7L, 8L, 12L, 13L, 14L), .Label = c("A", "B", 
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N"), class = "factor"), 
    Group = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 3L, 5L, 2L, 5L, 
    1L, 2L, 1L, 3L, 1L, 4L, 4L, 2L, 3L, 1L, 1L, 2L, 1L, 2L, 3L, 
    2L, 1L, 2L, 2L, 3L), .Label = c("Cat1", "Cat2", "Cat3", "Cat4", 
    "Cat5"), class = "factor")), .Names = c("Subject", "Card", 
"Group"), class = "data.frame", row.names = c(NA, -30L))

根据这些数据,我想创建一个相似度矩阵,理想情况下是将项目分组在一起的总计数的比例或百分比.

From these data I'd like to create a similarity matrix, ideally of proportion or percentage of total counts where items were grouped together.

类似这样的东西:

计数:

    A   B   C   D   E   F   G   H   I   J   K   L   M   N
A       0   0   1   1   0   0   1   0   0   0   0   0   0
B   0       0   0   1   0   0   0   2   0   0   0   0   1
C   0   0       0   0   1   2   0   0   0   0   2   1   0
D   1   0   0       0   0   0   1   0   0   0   0   0   0
E   1   1   0   0       0   1   0   1   0   0   1   1   1
F   0   0   1   0   0       1   0   0   0   0   0   0   1
G   0   0   2   0   1   1       0   0   0   0   1   2   0
H   1   0   0   1   0   0   0       0   1   0   0   0   0
I   0   2   0   0   1   0   0   0       0   0   0   0   1
J   0   0   0   0   0   0   0   1   0       1   0   0   0
K   0   0   0   0   0   0   0   0   0   1       0   0   0
L   0   0   2   0   1   0   1   0   0   0   0       1   0
M   0   0   1   0   1   0   2   0   0   0   0   1       0
N   0   1   0   0   1   1   0   0   1   0   0   0   0   

每个主题的小组名称都不同,因此无法按小组编制索引.

Every subject named their Groups differently, so it's not possible to index by Group.

除了计数,我还想生成一个相似度矩阵,该矩阵报告参与者的百分比,这些参与者被呈现为一对特定的Cards,并将这两个Cards分组在一起.

In addition to counts, I'd also like to generate a similarity matrix that reports the percentage of participants, who were presented with a particular pair of Cards, that grouped those two Cards together.

根据示例数据集,结果如下:

From the example data set, this as a result:

    A   B   C   D   E   F   G   H   I   J   K   L   M   N
A       0   0   50  50  0   0   50  0   0   0   0   0   0
B   0       0   0   50  0   0   0   100 0   0   0   0   100
C   0   0       0   0   50  67  0   0   0   0   100 50  0
D   50  0   0       0   0   0   50  0   0   0   0   0   0
E   50  50  33  0       0   33  0   50  0   0   33  50  50
F   0   0   50  0   0       50  0   0   0   0   0   0   100
G   0   0   67  0   33  50      0   0   0   0   50  100 0
H   50  0   0   50  0   0   0       0   100 0   0   0   0
I   0   100 0   0   50  0   0   0       0   0   0   0   100
J   0   0   0   0   0   0   0   100 0       100 0   0   0
K   0   0   0   0   0   0   0   0   0   100     0   0   0
L   0   0   100 0   33  0   50  0   0   0   0       50  0
M   0   0   50  0   50  0   100 0   0   0   0   50      0
N   0   100 0   0   50  100 0   0   100 0   0   0   0   

任何建议将不胜感激!

尽管以下答案适用于示例数据.对于发布在这里的实际数据,它似乎不起作用: https: //www.dropbox.com/s/mhqwy​​ok0nmvt3g9/Sim_Example.csv?dl=0

例如,在这些数据中,我手动计算了22对飞机"和机场"的配对,大约为55%.但是下面的答案得出的数字是12%和60%

For example, in those data I manually count 22 pairings of "Aircraft" and "Airport", which would be ~55%. But the answer below yields a count of 12 and 60%

推荐答案

基于OP的需求澄清解决方案

第1步.处理数据以创建卡对.是否已将它们由任何用户分组在一起:

Step 1. Process data to create card pairs & whether they've been grouped together by any user:

library(tidyverse); library(data.table)

Data.matrix <- Data %>% 

  # convert data into list of data frames by subject
  split(Data$Subject) %>%

  # for each subject, we create all pair combinations based on the subset cards he 
  # received, & note down whether he grouped the pair into the same group 
  # (assume INTERNAL group naming consistency. i.e. if subject 1 uses group names such 
  # as "cat", "dog", "rat", they are all named exactly so, & we don't worry about 
  # variations / typos such as "cat1.5", "dgo", etc.)
  lapply(function(x){
    data.frame(V1 = t(combn(x$Card, 2))[,1],
               V2 = t(combn(x$Card, 2))[,2],
               G1 = x$Group[match(t(combn(x$Card, 2))[,1], x$Card)],
               G2 = x$Group[match(t(combn(x$Card, 2))[,2], x$Card)],
               stringsAsFactors = FALSE) %>%
      mutate(co.occurrence = 1,
             same.group = G1==G2) %>%
      select(-G1, -G2)}) %>%

  # combine the list of data frames back into one, now that we don't worry about group 
  # names, & calculate the proportion of times each pair is assigned the same group, 
  # based on the total number of times they occurred together in any subject's 
  # subset.
  rbindlist() %>%
  rowwise() %>%
  mutate(V1.sorted = min(V1, V2),
         V2.sorted = max(V1, V2)) %>%
  ungroup() %>%
  group_by(V1.sorted, V2.sorted) %>%
  summarise(co.occurrence = sum(co.occurrence),
            same.group = sum(same.group)) %>%
  ungroup() %>%
  rename(V1 = V1.sorted, V2 = V2.sorted) %>%
  mutate(same.group.perc = same.group/co.occurrence * 100) %>%

  # now V1 ranges from A:M, where V2 ranges from B:N. let's complete all combinations
  mutate(V1 = factor(V1, levels = sort(unique(Data$Card))),
         V2 = factor(V2, levels = sort(unique(Data$Card)))) %>%
  complete(V1, V2, fill = list(NA))

> Data.matrix
# A tibble: 196 x 5
       V1     V2 co.occurrence same.group same.group.perc
   <fctr> <fctr>         <dbl>      <int>           <dbl>
 1      A      A            NA         NA              NA
 2      A      B             1          0               0
 3      A      C             2          0               0
 4      A      D             2          1              50
 5      A      E             2          1              50
 6      A      F             2          0               0
 7      A      G             2          0               0
 8      A      H             2          1              50
 9      A      I             1          0               0
10      A      J             1          0               0
# ... with 186 more rows

# same.group is the number of times a card pair has been grouped together.
# same.group.perc is the percentage of users who grouped the card pair together.

第2步.为count&创建单独的矩阵.百分比:

Step 2. Create separate matrices for count & percentage:

# spread count / percentage respectively into wide form

Data.count <- Data.matrix %>%
  select(V1, V2, same.group) %>%
  spread(V2, same.group, fill = 0) %>%
  remove_rownames() %>%
  column_to_rownames("V1") %>%
  as.matrix()

Data.perc <- Data.matrix %>%
  select(V1, V2, same.group.perc) %>%
  spread(V2, same.group.perc, fill = 0) %>%
  remove_rownames() %>%
  column_to_rownames("V1") %>%
  as.matrix()

第3步.将上三角矩阵转换为对称矩阵(注意:我刚刚找到了一个更短且更整洁的解决方案此处):

Step 3. Convert the upper triangular matrices into symmetric matrices (note: I've just found a shorter & neater solution here):

# fill up lower triangle to create symmetric matrices
Data.count[lower.tri(Data.count)] <- t(Data.count)[lower.tri(t(Data.count))]
Data.perc[lower.tri(Data.perc)] <- t(Data.perc)[lower.tri(t(Data.perc))]

# ALTERNATE to previous step
Data.count <- pmax(Data.count, t(Data.count))
Data.perc <- pmax(Data.perc, t(Data.perc))

第4步.摆脱对角线,因为将卡与自身配对没有意义:

Step 4. Get rid of the diagonals since there's no point pairing a card with itself:

# convert diagonals to NA since you don't really need them
diag(Data.count) <- NA
diag(Data.perc) <- NA

第5步:验证结果:

> Data.count
   A  B  C  D  E  F  G  H  I  J  K  L  M  N
A NA  0  0  1  1  0  0  1  0  0  0  0  0  0
B  0 NA  0  0  1  0  0  0  2  0  0  0  0  1
C  0  0 NA  0  1  1  2  0  0  0  0  2  1  0
D  1  0  0 NA  0  0  0  1  0  0  0  0  0  0
E  1  1  1  0 NA  0  1  0  1  0  0  1  1  1
F  0  0  1  0  0 NA  1  0  0  0  0  0  0  1
G  0  0  2  0  1  1 NA  0  0  0  0  1  2  0
H  1  0  0  1  0  0  0 NA  0  1  0  0  0  0
I  0  2  0  0  1  0  0  0 NA  0  0  0  0  1
J  0  0  0  0  0  0  0  1  0 NA  1  0  0  0
K  0  0  0  0  0  0  0  0  0  1 NA  0  0  0
L  0  0  2  0  1  0  1  0  0  0  0 NA  1  0
M  0  0  1  0  1  0  2  0  0  0  0  1 NA  0
N  0  1  0  0  1  1  0  0  1  0  0  0  0 NA

> Data.perc
   A   B   C  D  E   F   G   H   I   J   K   L   M   N
A NA   0   0 50 50   0   0  50   0   0   0   0   0   0
B  0  NA   0  0 50   0   0   0 100   0   0   0   0 100
C  0   0  NA  0 33  50  67   0   0   0   0 100  50   0
D 50   0   0 NA  0   0   0  50   0   0   0   0   0   0
E 50  50  33  0 NA   0  33   0  50   0   0  50  50  50
F  0   0  50  0  0  NA  50   0   0   0   0   0   0 100
G  0   0  67  0 33  50  NA   0   0   0   0  50 100   0
H 50   0   0 50  0   0   0  NA   0 100   0   0   0   0
I  0 100   0  0 50   0   0   0  NA   0   0   0   0 100
J  0   0   0  0  0   0   0 100   0  NA 100   0   0   0
K  0   0   0  0  0   0   0   0   0 100  NA   0   0   0
L  0   0 100  0 50   0  50   0   0   0   0  NA  50   0
M  0   0  50  0 50   0 100   0   0   0   0  50  NA   0
N  0 100   0  0 50 100   0   0 100   0   0   0   0  NA

这篇关于从原始卡片排序数据创建相似度矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆