创建“索引”对于具有data.table的组的每个元素 [英] Create an "index" for each element of a group with data.table

查看:108
本文介绍了创建“索引”对于具有data.table的组的每个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的资料按V6中的编号分组,并按位置排序(V1:V3):

  dt 
V1 V2 V3 V4 V5 V6
1:chr1 3054233 3054733。 + ENSMUSG00000090025
2:chr1 3102016 3102125。 + ENSMUSG00000064842
3:chr1 3205901 3207317。 - ENSMUSG00000051951
4:chr1 3206523 3207317。 - ENSMUSG00000051951
5:chr1 3213439 3215632。 - ENSMUSG00000051951
6:chr1 3213609 3216344。 - ENSMUSG00000051951
7:chr1 3214482 3216968。 - ENSMUSG00000051951
8:chr1 3421702 3421901。 - ENSMUSG00000051951
9:chr1 3466587 3466687。 + ENSMUSG00000089699
10:chr1 3513405 3513553。 + ENSMUSG00000089699

我想做的是添加和额外的列索引按位置,是V6中的每个组,第一个元素将是1,第二个元素是2,依此类推。我可以使用ddply和一个自定义函数:

  rankExons<  -  function(x){
if unique(x $ V5)==+){
x $ index< - seq_len(nrow(x))}
else {
x $ index< - rev(seq_len (x)))
x
}

indexed< - ddply(dt,。(V6),rankExons)
indexed
V1 V2 V3 V4 V5 V6索引
1 chr1 3205901 3207317。 - ENSMUSG00000051951 6
2 chr1 3206523 3207317。 - ENSMUSG00000051951 5
3 chr1 3213439 3215632。 - ENSMUSG00000051951 4
4 chr1 3213609 3216344。 - ENSMUSG00000051951 3
5 chr1 3214482 3216968。 - ENSMUSG00000051951 2
6 chr1 3421702 3421901。 - ENSMUSG00000051951 1
7 chr1 3102016 3102125。 + ENSMUSG00000064842 1
8 chr1 3466587 3466687。 + ENSMUSG00000089699 1
9 chr1 3513405 3513553。 + ENSMUSG00000089699 2
10 chr1 3054233 3054733。不幸的是,对于整个数据集(〜620k行)来说,它是非常慢的,而且当使用时,这是非常缓慢的。并行崩溃和烧写:

 库(doMC)
registerDoMC(cores = 6)
索引; - ddply(dt,。(V6),rankExons,.parallel = TRUE)
错误:序列化太大而无法存储在原始向量中
错误:序列化太大,无法存储在原始向量中
错误:序列化太大而无法存储在原始向量中
错误:序列化太大而无法存储在原始向量中
错误:序列化太大,无法存储在原始向量中
错误:序列化太大,无法存储在原始向量中
警告消息:
在mclapply(argsList,FUN,mc.preschedule = preschedule,mc.set.seed = set.seed,:
所有计划的核心在用户代码中遇到错误

所以,我去了data.table但couldn下面是我试过的:

  setkey(dt,V6)

dt [,index:= rankExons(dt),by = V6]
dt [,rankExons(.sd),by = V6,.SDcols = c(V5,V6)]

而且都失败了。如何用data.table重新创建ddply?

  dput(dt)
结构(chr1,chr1,chr1,chr1,chr1,
chr1,chr1,chr1,chr1,chr1),V2 = c 3054233L,3102016L,
3205901L,3206523L,3213439L,3213609L,3214482L,3421702L,3466587L,
3513405L),V3 = c(3054733L,3102125L,3207317L,3207317L,3215632L,
3216344L,3216968L ,3421901L,3466687L,3513553L),V4 = c(。,
。,。,。,。 ,。),V5 = c(+,+,
- , - , - , - ,+),V6 = c(ENSMUSG00000090025,
ENSMUSG00000064842,ENSMUSG00000051951,ENSMUSG00000051951,
ENSMUSG00000051951,ENSMUSG00000051951,ENSMUSG00000051951,
ENSMUSG00000051951,ENSMUSG00000089699,ENSMUSG00000089699
)),.Names = c(V1,V2,V3,V4,V5,V6 c(data.table,
data.frame),row.names = c(NA,-10L),.internal.selfref =< pointer:0x1de6a88>)


解决方案

作为一个生物信息学家,我经常遇到这个操作。这是我喜欢 data.table 通过引用修改行的子集功能!



我会这样做:

  dt [V5 ==+,index:= 1: .N,by = V6] 
dt [V5 == - ,index:= .N:1,by = V6]

不需要任何功能。这有点更有利,因为它避免了检查 == + - 一次 !您可以先用 + 所有组进行分组,然后按 V6 并只修改这些行即可!



同样, - 。希望有帮助。


注意: .N 是一个特殊变量,每组的观察次数。



My data is grouped by the IDs in V6 and ordered by position (V1:V3):

dt
      V1      V2      V3 V4 V5                 V6
 1: chr1 3054233 3054733  .  + ENSMUSG00000090025
 2: chr1 3102016 3102125  .  + ENSMUSG00000064842
 3: chr1 3205901 3207317  .  - ENSMUSG00000051951
 4: chr1 3206523 3207317  .  - ENSMUSG00000051951
 5: chr1 3213439 3215632  .  - ENSMUSG00000051951
 6: chr1 3213609 3216344  .  - ENSMUSG00000051951
 7: chr1 3214482 3216968  .  - ENSMUSG00000051951
 8: chr1 3421702 3421901  .  - ENSMUSG00000051951
 9: chr1 3466587 3466687  .  + ENSMUSG00000089699
10: chr1 3513405 3513553  .  + ENSMUSG00000089699

What I would like to do is to add and extra column with an index by position, that is, per group in V6 the first element would be "1", the second "2", and so on. I can achieve that using ddply and a custom function:

rankExons <- function(x){
   if(unique(x$V5) == "+"){ 
         x$index <- seq_len(nrow(x))}
   else{
         x$index <- rev(seq_len(nrow(x)))}
   x
}

indexed <- ddply(dt, .(V6), rankExons)
indexed
     V1      V2      V3 V4 V5                 V6 index
1  chr1 3205901 3207317  .  - ENSMUSG00000051951     6
2  chr1 3206523 3207317  .  - ENSMUSG00000051951     5
3  chr1 3213439 3215632  .  - ENSMUSG00000051951     4
4  chr1 3213609 3216344  .  - ENSMUSG00000051951     3
5  chr1 3214482 3216968  .  - ENSMUSG00000051951     2
6  chr1 3421702 3421901  .  - ENSMUSG00000051951     1
7  chr1 3102016 3102125  .  + ENSMUSG00000064842     1
8  chr1 3466587 3466687  .  + ENSMUSG00000089699     1
9  chr1 3513405 3513553  .  + ENSMUSG00000089699     2
10 chr1 3054233 3054733  .  + ENSMUSG00000090025     1

Unfortunately, it is extremely slow on the full dataset (~620k rows) and when using parallel it crashes and burns:

library(doMC)
registerDoMC(cores=6)
indexed <- ddply(dt, .(V6), rankExons, .parallel=TRUE)
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Error: serialization is too large to store in a raw vector
Warning message:
In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed,  :
  all scheduled cores encountered errors in user code

So , I went for data.table but couldn't get it working. Here is what I tried:

setkey(dt, "V6")

dt[,index:=rankExons(dt), by=V6]
dt[,rankExons(.sd), by=V6, .SDcols=c("V5, V6")]

And both failed. How can I recreate my ddply with data.table?

dput(dt)
structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1"), V2 = c(3054233L, 3102016L, 
3205901L, 3206523L, 3213439L, 3213609L, 3214482L, 3421702L, 3466587L, 
3513405L), V3 = c(3054733L, 3102125L, 3207317L, 3207317L, 3215632L, 
3216344L, 3216968L, 3421901L, 3466687L, 3513553L), V4 = c(".", 
".", ".", ".", ".", ".", ".", ".", ".", "."), V5 = c("+", "+", 
"-", "-", "-", "-", "-", "-", "+", "+"), V6 = c("ENSMUSG00000090025", 
"ENSMUSG00000064842", "ENSMUSG00000051951", "ENSMUSG00000051951", 
"ENSMUSG00000051951", "ENSMUSG00000051951", "ENSMUSG00000051951", 
"ENSMUSG00000051951", "ENSMUSG00000089699", "ENSMUSG00000089699"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), class = c("data.table", 
"data.frame"), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x1de6a88>)

解决方案

As a fellow bioinformatician, I come across this operation quite frequently. And this is where I adore data.table's modify subset of rows by reference feature!

I'd do it like this:

dt[V5 == "+", index := 1:.N, by=V6]
dt[V5 == "-", index := .N:1, by=V6]

No functions required. This is a little more advantageous because it avoids having to check for == "+" or "-" once for every group! Instead, you can first subset all groups with + once and then group by V6 and modify just those rows in place!

Similarly you do it once again for "-". Hope that helps.

Note: .N is a special variable that contains the number of observations per group.

这篇关于创建“索引”对于具有data.table的组的每个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆