查找组中两个变量之间的共现 [英] Finding cooccurences between two variables within a group

查看:54
本文介绍了查找组中两个变量之间的共现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望通过找到一个组内两个不同变量之间的共现来有效地计算一个共现矩阵,理想情况下无需使用迭代所有可能组合的复杂循环。



鉴于我的数据框如下所示:

  df = data .frame(group = c(1,1,1,2,2,2),var1 = c(1,2,4,2,2,4),var2 = c(4,1,2,1,3 ,2))

> df
组var1 var2
1 1 1 4
2 1 2 1
3 1 4 2
4 2 2 1
5 2 2 3
6 2 4 2

我希望将其转化为新的共现矩阵,其中行代表var1,列代表var2。



编辑:对于那些不熟悉共现的人,我对一组中同时出现的价值对感兴趣。例如, 2和 1的组合在组1中发生一次,而在组2中其他时间发生,因此意味着出现2次共现。在我的示例中,我将组合彼此相邻,但是它们可能会出现在组内的任何地方。



它应如下所示:

 > cooc 
1 2 3 4
1 0 2 0 1
2 2 0 1 2
3 0 1 0 0
4 1 2 0 0

在使用xtabs函数仅使用组中的一个变量处理同现时,我已经这样做过确保如何将其应用于多个列。例如,如果我有兴趣查找不同组中var1的共存,则可以执行以下操作:

  > td = xtabs(〜group + var1,data = df)
> cooc = crossprod(td,td)
> diag(cooc)= 0


解决方案

如果我了解您问题正确,我相信这应该可行:

 #我只在这里使用data.table,以防万一我们需要这样做 group 
#,但在此解决方案中我不使用它,因为我没有看到分组
### library(data.table)
###的意义
# df<-data.table(df)

#这将创建一对值 a_b
df $ ID<-paste(df $ var1,df $ var2,sep = _)
#我们枚举所有唯一值,这样我们就可以创建
#映射以稍后匹配数据并映射
uniqval<-sort(unique(c(df $ var1,df $ var2)))
grid<-expand.grid(uniqval,uniqval)
grid $ ID<-paste(grid $ Var1,grid $ Var2,sep = _)
#将我们的数据与该地图匹配
匹配<-sort(match(df $ ID,grid $ ID))
#将结果汇总到数据框
标签中< -data.frame(table(grid $ ID [matches]))
#将ID拆分回值
tab $ Var2<-s ubstr(tab $ Var1,3,3)
tab $ Var1<-substr(tab $ Var1,1,1)
#创建空结果矩阵
cooc<-matrix( 0,nrow = length(uniqval),ncol = length(uniqval))
行名(cooc)<-uniqval
colnames(cooc)<-uniqval

#还有其他方法可以做到这一点
#但对我来说这似乎很简单
#我们只需要将列表结果
#替换到矩阵
中的所需位置#即 a_b频率进入[a,b]和[b,a]位置
for(m in 1:nrow(tab)){

i<-tab $ Var1 [m]
j<-tab $ Var2 [m]

#通过将其添加到先前的值
#中,我们得出的是 a_b当量。到 b_a
cooc [i,j] <-cooc [i,j] + tab $ Freq [m]
cooc [j,i] <-cooc [i,j]

}


I am hoping to efficiently compute a co-occurence matrix by finding the co-occurences between two different variables within a group, ideally without using a complex loop that iterates through all possible combinations.

Given that my dataframe looks as follows:

df = data.frame(group = c(1,1,1,2,2,2),var1 = c(1,2,4,2,2,4),var2 = c(4,1,2,1,3,2))

> df
  group var1 var2
1     1    1    4
2     1    2    1
3     1    4    2
4     2    2    1
5     2    2    3
6     2    4    2

I am hoping to turn this into a new co-occurence matrix, where the rows represent var1 and columns var2.

EDIT: For those unfamiliar with co-occurences, I am interested in pairs of values that occur simultaneously in a group. For example, the combination of "2" and "1" happens once in group 1, and other time in group 2, thus implying 2 co-occurences. In my example, I put the combination next two each other, but they could occur anywhere within the group.

It should look like the following:

> cooc
  1 2 3 4
1 0 2 0 1
2 2 0 1 2
3 0 1 0 0
4 1 2 0 0

I have done this before when dealing with co-occurences using just one variable within a group by using the xtabs function, but not sure how to apply it to multiple columns. For example, if I was interested in finding the co-occurences for var1 within the different groups, I would do the following:

> td = xtabs(~group + var1,data = df)
> cooc = crossprod(td,td)
> diag(cooc) = 0

解决方案

if i am understanding your question correctly, I believe this should work:

# i only use data.table here in case we need to do this "by group"
# but in this solution I do not use it as i did not see the significance
# of grouping
###library(data.table)
###df <-  data.table(df)

# this creates the pair of values "a_b"
df$ID <- paste(df$var1,df$var2,sep="_")
# we enumerate all the unique values that way we can create 
# a map to later match the data and map
uniqval <- sort(unique(c(df$var1,df$var2)))
grid <- expand.grid(uniqval,uniqval)
grid$ID <- paste(grid$Var1,grid$Var2,sep="_")
# match our data to this map
matches <- sort(match(df$ID,grid$ID))
# tabulate our results into a dataframe
tab <- data.frame(table(grid$ID[matches]))
# split up our ID back into values
tab$Var2 <- substr(tab$Var1,3,3)
tab$Var1 <- substr(tab$Var1,1,1)
# create our empty result matrix
cooc <- matrix(0,nrow=length(uniqval),ncol=length(uniqval))
rownames(cooc) <- uniqval
colnames(cooc) <- uniqval

# there are other ways to do this
# but this seemed simple enough of a loop for me
# we just need to replace the tabulation results
# into our desired location in the matrix
# namely, "a_b" frequencies into [a,b] and [b,a] positions
for(m in 1:nrow(tab)){

  i <- tab$Var1[m]
  j <- tab$Var2[m]

# by adding this to the previous value
# we are accounting for "a_b" equiv. to "b_a"
  cooc[i,j] <- cooc[i,j]+tab$Freq[m]
  cooc[j,i] <- cooc[i,j]

}

这篇关于查找组中两个变量之间的共现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆