模拟R中的同现数据 [英] Simulating Co-occurrence data in R

查看:195
本文介绍了模拟R中的同现数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个共同数据的数据集,其中感兴趣的变量是一个软件应用程序,我想模拟一个n×n矩阵,其中每个单元格都有一个数字,表示应用程序A的次数与应用程序B一起使用。如何在 R 中创建可用于测试一组集群和分区算法的数据集。我将使用什么模型,以及如何在 R 中生成数据?

解决方案

 
软件名称:
软件< - c(a,b,c,d)
每个软件使用#次:
times.each.sw < - c(5,10,12,3)

#同现数据。帧
swdf< - setNames(data.frame(t(combn(software,2))),c(sw1,sw2))
swdf $ freq.cooc< - apply(combn(times.each.sw ,2),2,function(x)sample(1:min(x),1))
#sw1 sw2 freq.cooc
#1 ab 5
#2 ac 5
#3广告1
#4 bc 9
#5 bd 2
#6 cd 2

如果你喜欢一个共同的矩阵,那么这样可能是:

  mat <  -  diag(times.each.sw)
dimnames(mat)< - 列表(软件,软件)
mat [lower.tri(mat)]< - swdf $ freq.cooc
mat [upper.tri(mat)]< - t(mat)[upper.tri(mat)]

#a bcd
#a 5 5 5 1
#b 5 10 9 2
#c 5 9 12 2
#d 1 2 2 3

对角线包含每个软件使用的次数(即与自己一起使用)。下/上三角形将包含每个组合使用的次数,总是必须等于或小于使用较少频繁使用的次数。


I am trying to create a data set of co-occurrence data where the variable of interest is a software application and I want to simulate an n by n matrix where each cell has a number that says the number of times application A was used with application B. How can I create a data set in R that I can use to test a set of clustering and partitioning algorithms. What model would I use and how would I generate the data in R?

解决方案

set.seed(42)
# software names:
software <- c("a","b","c","d")
# times each software used:
times.each.sw <- c(5,10,12,3)

# co-occurrence data.frame
swdf <- setNames(data.frame(t(combn(software,2))),c("sw1","sw2"))
swdf$freq.cooc <- apply(combn(times.each.sw,2),2,function(x) sample(1:min(x),1) )
#  sw1 sw2 freq.cooc
#1   a   b         5
#2   a   c         5
#3   a   d         1
#4   b   c         9
#5   b   d         2
#6   c   d         2

If you prefer a matrix of co-occurrence, then something like this maybe:

mat <- diag(times.each.sw) 
dimnames(mat) <- list(software,software)
mat[lower.tri(mat)] <- swdf$freq.cooc
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]

#  a  b  c d
#a 5  5  5 1
#b 5 10  9 2
#c 5  9 12 2
#d 1  2  2 3

The diagonal contains the number of times each software was used (i.e. used with itself). The lower/upper triangles will contain the number of times each combination was used, which will always have to be equal or less to the number of times the less frequently used of the pair was used.

这篇关于模拟R中的同现数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆