计算tsv文件不同部分中由定界符分隔的字符串的频率 [英] Calculate the frequency of strings separated by delimiter in different section of tsv file

查看:88
本文介绍了计算tsv文件不同部分中由定界符分隔的字符串的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框mydf,其中左基因和右基因之间用':'分隔.我需要计算每个文件的LeftGeneRightGene列中这些基因的出现次数,并得到类似的结果.在R中执行此操作的最佳方法是什么?

I have a dataframe mydf, where the Left and Right genes are separated by ':'. I need to calculate the number of occurrence of these genes in LeftGene and RightGene columns per file and get something like in the result. What would be the best way to do it in R?

sample     LeftGene    RightGene
file1
           ATT:TAA
           ATT:ATT      ATT
file2      
           TTP:TTG      TTP:TTP

结果

file1
LeftGene           RightGene
ATT=3              ATT=1
TAA=1

file2
LeftGene           RightGene
TTP=1              TTP=2
TTG=1

收件人:akrun

这是我们有file_name的实际数据的输出,并且需要获取每个文件中Left.Gene.SymbolsRight.Gene.Symbols的频率.我也很想从所有文件中查看这些基因的频率(累加).谢谢您的帮助.

Here is the dput of actual data where we have file_name, and need to get the frequency of Left.Gene.Symbols and Right.Gene.Symbols in each file. I would also love to see the frequency of these genes from all files (cumulative). Thank you for your help.

mydf<-structure(c("AMLM12001KP", NA, "1114002", NA, NA, NA, NA, NA, 
"1121501", NA, NA, NA, "NA", "NA", "NA", "NA", "CR1L", "GIGYF2:GIGYF2:GIGYF2:ENPP3", 
"NA", "NA", "NA", "NA", "NTNG1:NTNG1:ENPP3", "NA", "NA", "NA", 
"NA", "NA", "CDC27:CDC27", "NA", "ENPP3", "NA", "NA", "NA", "NA", 
"NA"), .Dim = c(12L, 3L), .Dimnames = list(NULL, c("files_name", 
"Left.Gene.Symbols", "Right.Gene.Symbols")))

预期输出:

AMLM12001KP
Left.Gene.Symbols       Right.Gene.Symbols

1114002
Left.Gene.Symbols       Right.Gene.Symbols
CR1L=1                  CDC27=2
GIGYF2=3                ENPP3=1
ENPP3=1

1121501
Left.Gene.Symbols       Right.Gene.Symbols
NTNG1=2 
ENPP3=1

All files
Left.Gene.Symbol        Right.Gene.Symbols
CR1L=1                  CDC27=2
GIGYF2=3                ENPP3=1
NTNG1=2 
ENPP3=2

推荐答案

编辑

dd2<-structure(c("AMLM12001KP", NA, "1114002", NA, NA, NA, NA, NA,"1121501", NA, NA, NA, "NA", "NA", "NA", "NA", "CR1L", "GIGYF2:GIGYF2:GIGYF2:ENPP3","NA", "NA", "NA", "NA", "NTNG1:NTNG1:ENPP3", "NA", "NA", "NA","NA", "NA", "CDC27:CDC27", "NA", "ENPP3", "NA", "NA", "NA", "NA", "NA"), .Dim = c(12L, 3L), .Dimnames = list(NULL, c("files_name", "Left.Gene.Symbols", "Right.Gene.Symbols")))


## change character NAs to <NA> and carry-forward the file column
dd2[dd2 == 'NA'] <- NA
dd2[, 1] <- na.omit(unique(dd2[, 1]))[cumsum(!is.na(dd2[, 1]))]

## split based on file name
sp <- split(data.frame(dd2, stringsAsFactors = FALSE), dd2[, 1])

## split each string by `:` and make a table
(l <- lapply(sp, function(x) {
  x <- droplevels(x[, -1])
  f <- function(x) na.omit(unlist(strsplit(x, ':')))
  left <- f(x[, 1])
  right <- f(x[, 2])
  table(c(left, right), rep(names(x), c(length(left), length(right))))
}))

# $`1114002`
# 
#          Left.Gene.Symbols Right.Gene.Symbols
#   CDC27                  0                  2
#   CR1L                   1                  0
#   ENPP3                  1                  1
#   GIGYF2                 3                  0
# 
# $`1121501`
# 
#         Left.Gene.Symbols
#   ENPP3                 1
#   NTNG1                 2
# 
# $AMLM12001KP
# < table of extent 0 x 0 >

由于每个列表元素都是一个表,因此可以将它们作为表使用

And since each list element is a table, work with them as tables

data.frame(l$`1114002`)

#     Var1               Var2 Freq
# 1  CDC27  Left.Gene.Symbols    0
# 2   CR1L  Left.Gene.Symbols    1
# 3  ENPP3  Left.Gene.Symbols    1
# 4 GIGYF2  Left.Gene.Symbols    3
# 5  CDC27 Right.Gene.Symbols    2
# 6   CR1L Right.Gene.Symbols    0
# 7  ENPP3 Right.Gene.Symbols    1
# 8 GIGYF2 Right.Gene.Symbols    0


这是另一种令人讨厌的格式


Here's another way in a listy format

rl <- readLines(textConnection("
sample     LeftGene    RightGene
file1
           ATT:ATT      ATT
file2      
           TTP:TTG      TTP:TTP
"))

dd <- setNames(read.table(text = rl[grep('file', rl) + 1], stringsAsFactors = FALSE),
               c('LeftGene','RightGene'))
rownames(dd) <- paste0('File', 1:nrow(dd))

setNames(lapply(1:nrow(dd), function(x) {
  sp <- strsplit(unlist(dd[x, ]), ':')
  table(unlist(sp), rep(names(sp), lengths(sp)))
}), rownames(dd))

# $File1
#     
#       LeftGene RightGene
#   ATT        2         1
#
# $File2
#      
#       LeftGene RightGene
#   TTG        1         0
#   TTP        1         2

setNames(lapply(1:nrow(dd), function(x) {
  sp <- strsplit(unlist(dd[x, ]), ':')
  lapply(sp, function(y) data.frame(table(y)))
}), rownames(dd))


# $File1
# $File1$LeftGene
#     y Freq
# 1 ATT    2
# 
# $File1$RightGene
#     y Freq
# 1 ATT    1
# 
# 
# $File2
# $File2$LeftGene
#     y Freq
# 1 TTG    1
# 2 TTP    1
# 
# $File2$RightGene
#     y Freq
# 1 TTP    2

这篇关于计算tsv文件不同部分中由定界符分隔的字符串的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆