来自簇和共现因子列表的维恩图 [英] Venn diagram from list of clusters and co-occurring factors

查看:98
本文介绍了来自簇和共现因子列表的维恩图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个输入文件,其中包含〜50000个簇的列表,并且每个簇中都有许多因素(总计〜1000万个条目),请参见下面的一个较小示例:

I've got an input file with a list of ~50000 clusters and presence of a number of factors in each of them (~10 million entries in total), see a smaller example below:

set.seed(1)
x = paste("cluster-",sample(c(1:100),500,replace=TRUE),sep="")
y = c(
  paste("factor-",sample(c(letters[1:3]),300, replace=TRUE),sep=""),
  paste("factor-",sample(c(letters[1]),100, replace=TRUE),sep=""),
  paste("factor-",sample(c(letters[2]),50, replace=TRUE),sep=""),
  paste("factor-",sample(c(letters[3]),50, replace=TRUE),sep="")
)
data = data.frame(cluster=x,factor=y)

在另一个问题的帮助下,我制作了一个饼图,用于同时出现以下因素:

With a bit of help from another question, I got it to produce a piechart for co-occurrence of factors like this:

counts = with(data, table(tapply(factor, cluster, function(x) paste(as.character(sort(unique(x))), collapse='+'))))
pie(counts[counts>1])

但是现在我想具有维恩图以显示因素的同时出现。理想情况下,也可以采用每个因素的最小计数阈值。例如,针对不同因素的维恩图,以便每个因素中每个因素都必须存在n> 10。

But now I would like to have a venn diagram for the co-occurrence of factors. Ideally, also in a way that can take a threshold for the minimum count for each factor. For example, a venn diagram for the different factors so that each one of them has to be present n>10 in each cluster to be taken into account.

试图找到一种方法来产生带有聚合的表计数,但无法使其工作。

I've tried to find a way to produce the table counts with aggregate, but couldn't make it work.

推荐答案

我提供了两种解决方案,使用具有维恩图功能的两个不同软件包。如您所料,两者都涉及使用 aggregate()函数的第一步。

I've provided two solutions, using two different packages with Venn diagram capabilities. As you expected, both involve an initial step using the aggregate() function.

我倾向于使用 venneuler 软件包的结果。它的默认标签位置并不理想,但是您可以通过查看相关的 plot 方法(可能使用 locator() code>选择坐标)。

I tend to prefer the results from the venneuler package. It's default label positions aren't ideal, but you could adjust them by having a look at the associated plot method (possibly using locator() to select the coordinates).

解决方案一:

一种可能性是在 venneuler 包中使用 venneuler()绘制维恩图。

One possibility is to use venneuler() in the venneuler package to draw your Venn diagram.

library(venneuler)

## Modify the "factor" column, by renaming it and converting
## it to a character vector.
levels(data$factor) <- c("a", "b", "c")
data$factor <- as.character(data$factor)

## FUN is an anonymous function that determines which letters are present
## 2 or more times in the cluster and then pastes them together into 
## strings of a form that venneuler() expects.
##
inter <- aggregate(factor ~ cluster, data=data,
                   FUN = function(X) {
                       tab <- table(X)
                       names <- names(tab[tab>=2])
                       paste(sort(names), collapse="&")
                   })            
## Count how many clusters contain each combination of letters
counts <- table(inter$factor)
counts <- counts[names(counts)!=""]  # To remove groups with <2 of any letter
#  a   a&b a&b&c   a&c     b   b&c     c 
# 19    13    12    14    13     9    12 

## Convert to proportions for venneuler()
ps <- counts/sum(counts)

## Calculate the Venn diagram
vd <- venneuler(c(a=ps[["a"]], b = ps[["b"]], c = ps[["c"]],
                  "a&b" = ps[["a&b"]],
                  "a&c" = ps[["a&c"]],
                  "b&c" = ps[["b&c"]],
                  "a&b&c" = ps[["a&b&c"]]))
## Plot it!
plot(vd)

关于我在编写此代码时所做选择的一些注意事项:

A few notes about choices I made in writing this code:


  • 我从 factor-a 到 a 。您显然可以改回来。

  • I've changed the names of factors from "factor-a" to "a". You can obviously change that back.

我只要求每个因子在每个群集中的出现次数必须大于等于2次(而不是大于10次)。 (那是用数据的一小部分来演示代码。)

I've only required each factor to be present >=2 times (instead of >10) to be counted within each cluster. (That was to demonstrate the code with this small subset of your data.)

如果您查看中间对象的数量,您将看到它包含一个初始的未命名元素。该元素是包含少于两个字母的簇的数量。您可以比我更好地决定是否要在随后的 ps (比例)对象的计算中包括那些对象。

If you take a look at the intermediate object counts, you'll see that it contains an initial unnamed element. That element is the number of clusters that contain fewer than 2 of any letter. You can decide better than I whether or not you want to include those in the calculation of the subsequent ps ('proportions') object.

第二种解决方案:

另一种可能性是雇用生物导体软件包 limma 中的vennCounts() vennDiagram()。要下载此软件包,请按照此处的说明进行操作。 venneuler 解决方案,结果图中的重叠与实际的相交度不成比例。而是用实际频率为图表添加注释。 (请注意,此解决方案不涉及对 data $ factor 列的任何编辑。)

Another possibility is to employ vennCounts() and vennDiagram() in the Bioconductor package limma. To download the package, follow the instructions here. Unlike the venneuler solution above, the overlap in the resultant diagram is not proportional to the actual degree of intersection. Instead, it annotates the diagram with the actual frequencies. (Note that this solution does not involve any edits to the data$factor column.)

library(limma)

out <- aggregate(factor ~ cluster, data=data, FUN=table)
out <- cbind(out[1], data.frame(out[2][[1]]))

counts <- vennCounts(out[, -1] >= 2)
vennDiagram(counts, names = c("Factor A", "Factor B", "Factor C"),
            cex = 1, counts.col = "red")

这篇关于来自簇和共现因子列表的维恩图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆