计算tsv文件不同部分中由定界符分隔的字符串的频率 [英] Calculate the frequency of strings separated by delimiter in different section of tsv file
问题描述
我有一个数据框mydf
,其中左基因和右基因之间用':'分隔.我需要计算每个文件的LeftGene
和RightGene
列中这些基因的出现次数,并得到类似的结果.在R中执行此操作的最佳方法是什么?
I have a dataframe mydf
, where the Left and Right genes are separated by ':'. I need to calculate the number of occurrence of these genes in LeftGene
and RightGene
columns per file and get something like in the result. What would be the best way to do it in R?
sample LeftGene RightGene
file1
ATT:TAA
ATT:ATT ATT
file2
TTP:TTG TTP:TTP
结果
file1
LeftGene RightGene
ATT=3 ATT=1
TAA=1
file2
LeftGene RightGene
TTP=1 TTP=2
TTG=1
收件人:akrun
这是我们有file_name
的实际数据的输出,并且需要获取每个文件中Left.Gene.Symbols
和Right.Gene.Symbols
的频率.我也很想从所有文件中查看这些基因的频率(累加).谢谢您的帮助.
Here is the dput of actual data where we have file_name
, and need to get the frequency of Left.Gene.Symbols
and Right.Gene.Symbols
in each file. I would also love to see the frequency of these genes from all files (cumulative). Thank you for your help.
mydf<-structure(c("AMLM12001KP", NA, "1114002", NA, NA, NA, NA, NA,
"1121501", NA, NA, NA, "NA", "NA", "NA", "NA", "CR1L", "GIGYF2:GIGYF2:GIGYF2:ENPP3",
"NA", "NA", "NA", "NA", "NTNG1:NTNG1:ENPP3", "NA", "NA", "NA",
"NA", "NA", "CDC27:CDC27", "NA", "ENPP3", "NA", "NA", "NA", "NA",
"NA"), .Dim = c(12L, 3L), .Dimnames = list(NULL, c("files_name",
"Left.Gene.Symbols", "Right.Gene.Symbols")))
预期输出:
AMLM12001KP
Left.Gene.Symbols Right.Gene.Symbols
1114002
Left.Gene.Symbols Right.Gene.Symbols
CR1L=1 CDC27=2
GIGYF2=3 ENPP3=1
ENPP3=1
1121501
Left.Gene.Symbols Right.Gene.Symbols
NTNG1=2
ENPP3=1
All files
Left.Gene.Symbol Right.Gene.Symbols
CR1L=1 CDC27=2
GIGYF2=3 ENPP3=1
NTNG1=2
ENPP3=2
推荐答案
编辑
dd2<-structure(c("AMLM12001KP", NA, "1114002", NA, NA, NA, NA, NA,"1121501", NA, NA, NA, "NA", "NA", "NA", "NA", "CR1L", "GIGYF2:GIGYF2:GIGYF2:ENPP3","NA", "NA", "NA", "NA", "NTNG1:NTNG1:ENPP3", "NA", "NA", "NA","NA", "NA", "CDC27:CDC27", "NA", "ENPP3", "NA", "NA", "NA", "NA", "NA"), .Dim = c(12L, 3L), .Dimnames = list(NULL, c("files_name", "Left.Gene.Symbols", "Right.Gene.Symbols")))
## change character NAs to <NA> and carry-forward the file column
dd2[dd2 == 'NA'] <- NA
dd2[, 1] <- na.omit(unique(dd2[, 1]))[cumsum(!is.na(dd2[, 1]))]
## split based on file name
sp <- split(data.frame(dd2, stringsAsFactors = FALSE), dd2[, 1])
## split each string by `:` and make a table
(l <- lapply(sp, function(x) {
x <- droplevels(x[, -1])
f <- function(x) na.omit(unlist(strsplit(x, ':')))
left <- f(x[, 1])
right <- f(x[, 2])
table(c(left, right), rep(names(x), c(length(left), length(right))))
}))
# $`1114002`
#
# Left.Gene.Symbols Right.Gene.Symbols
# CDC27 0 2
# CR1L 1 0
# ENPP3 1 1
# GIGYF2 3 0
#
# $`1121501`
#
# Left.Gene.Symbols
# ENPP3 1
# NTNG1 2
#
# $AMLM12001KP
# < table of extent 0 x 0 >
由于每个列表元素都是一个表,因此可以将它们作为表使用
And since each list element is a table, work with them as tables
data.frame(l$`1114002`)
# Var1 Var2 Freq
# 1 CDC27 Left.Gene.Symbols 0
# 2 CR1L Left.Gene.Symbols 1
# 3 ENPP3 Left.Gene.Symbols 1
# 4 GIGYF2 Left.Gene.Symbols 3
# 5 CDC27 Right.Gene.Symbols 2
# 6 CR1L Right.Gene.Symbols 0
# 7 ENPP3 Right.Gene.Symbols 1
# 8 GIGYF2 Right.Gene.Symbols 0
这是另一种令人讨厌的格式
Here's another way in a listy format
rl <- readLines(textConnection("
sample LeftGene RightGene
file1
ATT:ATT ATT
file2
TTP:TTG TTP:TTP
"))
dd <- setNames(read.table(text = rl[grep('file', rl) + 1], stringsAsFactors = FALSE),
c('LeftGene','RightGene'))
rownames(dd) <- paste0('File', 1:nrow(dd))
setNames(lapply(1:nrow(dd), function(x) {
sp <- strsplit(unlist(dd[x, ]), ':')
table(unlist(sp), rep(names(sp), lengths(sp)))
}), rownames(dd))
# $File1
#
# LeftGene RightGene
# ATT 2 1
#
# $File2
#
# LeftGene RightGene
# TTG 1 0
# TTP 1 2
或
setNames(lapply(1:nrow(dd), function(x) {
sp <- strsplit(unlist(dd[x, ]), ':')
lapply(sp, function(y) data.frame(table(y)))
}), rownames(dd))
# $File1
# $File1$LeftGene
# y Freq
# 1 ATT 2
#
# $File1$RightGene
# y Freq
# 1 ATT 1
#
#
# $File2
# $File2$LeftGene
# y Freq
# 1 TTG 1
# 2 TTP 1
#
# $File2$RightGene
# y Freq
# 1 TTP 2
这篇关于计算tsv文件不同部分中由定界符分隔的字符串的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!