如何从长格式数据框计算成对计数表 [英] How to calculate a table of pairwise counts from long-form data frame
问题描述
我有一个长格式数据框,列为 id
(主键)和 featureCode
分类变量)。每个记录具有1到9个分类变量的值。例如:
I have a 'long-form' data frame with columns id
(the primary key) and featureCode
(categorical variable). Each record has between 1 and 9 values of the categorical variable. For example:
id featureCode
5 PPLC
5 PCLI
6 PPLC
6 PCLI
7 PPL
7 PPLC
7 PCLI
8 PPLC
9 PPLC
10 PPLC
我想计算每个功能代码与其他功能代码(标题的成对计数)一起使用的次数。在这个阶段,使用每个功能代码的顺序并不重要。我预计结果将是另一个数据框架,行和列是特征码,单元格是计数。例如:
I'd like to calculate the number of times each feature code is used with the other feature codes (the "pairwise counts" of the title). At this stage, the order each feature code is used is not important. I envisage the result would be another data frame, where the rows and columns are feature codes, and the cells are counts. For example:
PPLC PCLI PPL
PPLC 0 3 1
PCLI 3 0 1
PPL 1 1 0
不幸的是,我不知道如何执行这个计算,我画了在搜索意见时是空白的(主要是我怀疑,因为我不知道正确的术语)。
Unfortunately, I don't know how to perform this calculation and I've drawn a blank when searching for advice (mostly, I suspect, because I don't know the correct terminology).
推荐答案
一个 data.table
类似于@mrdwab的方法
Here is a data.table
approach similar to @mrdwab
如果 featureCode
是一个字符
library(data.table)
DT <- data.table(dat)
# convert to character
DT[, featureCode := as.character(featureCode)]
# subset those with >1 per id
DT2 <- DT[, N := .N, by = id][N>1]
# create all combinations of 2
# return as a data.table with these as columns `V1` and `V2`
# then count the numbers in each group
DT2[, rbindlist(combn(featureCode,2,
FUN = function(x) as.data.table(as.list(x)), simplify = F)),
by = id][, .N, by = list(V1,V2)]
V1 V2 N
1: PPLC PCLI 3
2: PPL PPLC 1
3: PPL PCLI 1
这篇关于如何从长格式数据框计算成对计数表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!