在LARGE数据集上生成R中的同现矩阵 [英] Generating a co-occurrance matrix in R on a LARGE dataset

查看:83
本文介绍了在LARGE数据集上生成R中的同现矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在一个非常大的数据集(26M行)上的R中创建一个共现矩阵,基本上看起来像这样:

ID        观察

I'm trying to create a co-occurrence matrix in R on a very large dataset (26M lines) that looks basically like this:

ID            Observation

11000           榕
11112            樱桃
11112            榕
12223                    杜松
12223                    橄榄
12223                    杜松
12223                    榕
12334            橄榄
12334            樱桃
12334            橄榄
...                                ...

11000             ficus
11112             cherry
11112             ficus
12223             juniper
12223             olive
12223             juniper
12223             ficus
12334             olive
12334             cherry
12334             olive
...             ...

并持续很长时间.我想按ID合并观察值,并生成由观察者ID观察到的观测值的共现矩阵.我在一部分数据上进行了管理,但是我手动"执行的某些操作对于整个数据集来说是不切实际的. (请参见下面的代码)我抵制使用循环的冲动,因为每个人都告诉我,如果您在R中使用循环,您做错了,但是我不确定如何实现这一点.

And on for a long time. I want to consolidate the observations by ID and generate a co-occurance matrix of observations observed by observer ID. I managed this on a subset of the data but some of the stuff I did "manually" that it wouldn't be practical to do for the entire set. (see code below) I am resisting the urge to use loops because everyone tells me if you use a loop in R you're doing it wrong, but I'm not sure how else to achieve this.

以下是适用于较小集合的内容:

Here's what worked for the smaller set:

tfmat = data.frame(cmsclip$ID[1:100],
              cmsclip$tree[1:100] %in% c(cmsclip$tree[1]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[2]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[3]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[4]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[5]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[6]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[7]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[9]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[10]),
               cmsclip$tree[1:100] %in% c(cmsclip$tree[11]))
colnames(tfmat) <- c('ID', cmsclip$tree[1:7], cmsclip$tree[9:11])

我刚刚选择了前10个唯一的树名,它们在整个数据块中恰好是cmsclip $ tree [1:7],cmsclip $ tree [9:11],而在前100个树中仅查看了它们观察,为每行上的每棵树生成一个TRUE/FALSE矩阵.然后,我使用split()通过ID号拆分合并数据,并对每列(现在代表一棵特定树)求和,以查看每个观察者记录了多少棵树.

I just picked the first 10 unique tree names which in the full block of data happened to be cmsclip$tree[1:7], cmsclip$tree[9:11], and only looked at those, on the first 100 observations, generating a TRUE/FALSE matrix for each tree on each line. Then I used split() to split consolidate the data by ID number and summed each column (now representing a specific tree) to see how many of that tree each observer recorded.

testsplit = split(tfmat, tfmat[1])
summed1 <-colSums(testsplit$`ficus`)
summed2 <-colSums(testsplit$`cherry`)
summed3 <-colSums(testsplit$`juniper`)
summed4 <-colSums(testsplit$`pine`)
summed5 <-colSums(testsplit$`olive`)
summed6 <-colSums(testsplit$`elm`)
summed7 <-colSums(testsplit$`rain`)
summed8 <-colSums(testsplit$`redwood`)
summed9 <-colSums(testsplit$`shimpaku`)
summed10 <-colSums(testsplit$`maple`)

问题是我手动输入了每个名称,而整个数据框架却无法做到这一点.这是我在寻找更好的地方.我认为将每一行合并成一个最终矩阵,该矩阵仅具有给定列中每棵树的观察次数,其中一行代表一个ID号.

The problem with this is I typed each name in by hand and I can't do that with the whole data frame. This is where I'm looking for something better. I think combined each line into a final matrix that just had the number of observations of each tree in a given column where one line represented one ID number.

finmat = data.frame(summed1[2:11],summed2[2:11],summed3[2:11],
                summed4[2:11],summed5[2:11],summed6[2:11],
                summed7[2:11],summed8[2:11],summed9[2:11],
                summed10[2:11],summed11[2:11])

然后我做到了...

finmat <- t(finmat)
treenames <- c(cmsclip$tree[1:7], cmsclip$tree[9:11])
colnames(finmat) <- treenames

total_occurrences <- colSums(finmat)
data_matrix <- as.matrix(finmat)
co_occurrence <- t(data_matrix) %*% data_matrix
library(igraph)
graph <- graph.adjacency(co_occurrence,
                                                weighted=TRUE,
                                                mode="undirected",
                                                diag=FALSE)

plot(graph,
      vertex.label=names(data),
      vertex.size=total_occurrences*5,
      edge.width=E(graph)$weight*8)

它工作正常,只是无法扩展到更大的数据集(太多手动数据).任何建议更快的方式将不胜感激.谢谢!

It worked fine, it's just not scaleable to a much larger set of data (too much manual stuff). Any suggestions for a faster way would be appreciate. Thanks!

理想情况下,我想创建一个矩阵,其中行对应于ID号,列对应于每棵单独的树,它会显示观察到的每棵树有多少.例如:

Ideally I would like to create a matrix where the rows correspond to the ID numbers and the columns to each individual tree, and it would show how many of each tree was observed. e.g.:

ID    榕树; olive       ...
11000      1         0                0     
11112      1      1      0      0     
12223      1      0      2      1     
12334      0      1      0      2     
...

ID      ficus      cherry      juniper      olive      ...
11000       1      0      0       0      
11112       1       1       0       0      
12223       1       0       2       1      
12334       0       1       0       2      
...

我应该补充一点,我实际上最终希望使用不同树木的观测值的同时出现来对数据进行分层聚类,以此来计算树木之间的距离".因此,关于如何实现这一目标的任何建议都将是很棒的.如果有一种简单的方法可以使我从距离矩阵跳到距离矩阵,那么我可能会放弃生成上述矩阵.

I should add that I actually ultimately want to do hierarchical clustering on the data using the co-occurrence of observations of the different trees as a way to calculate the "distance" between trees. So any suggestions on how to achieve that would be great. And I could potentially abandon generating the above matrix if there is an easy way to jump from what I have to a distance matrix.

推荐答案

尝试一下:

tapply(cmsclips$ID, cmsclips$Observation, FUN = c)

这里是一个例子:

test = data.frame(id = c(11,12,13,14), obs=c("cat", "dog", "cat", "cat"))
#  id obs
#1 11 cat
#2 12 dog
#3 13 cat
#4 14 cat
tapply(test$id, test$obs, FUN = c)
#$cat
#[1] 11 13 14
#
#$dog
#[1] 12

这篇关于在LARGE数据集上生成R中的同现矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆