比较两列和更改第三列时如何使用ifelse? [英] How to use ifelse when comparing two columns and changing a third?
问题描述
我仍然发现R中的ifelse结构有点令人困惑,我有以下数据框:
I still find the ifelse structure in R a bit confusing, I've got the following data frame:
df <- structure(list(snp = structure(1:11, .Label = c("AL0009", "AL00014", "AL0021", "AL00046", "AL0047", "AS0005", "AS0014", "AS00021", "AS0047", "AS0071", "DR0001" ), class = "factor"), CHROMOSOME = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), COUNT_ALLELE = structure(c(1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, 1L), .Label = c("A", "C", "G"), class = "factor"), OTHER_ALLELE = structure(c(3L, 3L, 2L, 1L, 3L, 2L, 2L, 1L, 1L, 1L, 3L), .Label = c("A", "C", "G"), class = "factor"), `116601888` = c(0L, 0L, 0L, 2L, 2L, 0L, 0L, 0L, 0L, 0L, 2L ), `116621563` = c(0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L), `117253533` = c(0L, 0L, 0L, 2L, 2L, 0L, 0L, 0L, 1L, 0L, 2L), `117423827` = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 2L)), .Names = c("snp", "CHROMOSOME", "COUNT_ALLELE", "OTHER_ALLELE", "11688", "11663", "11533", "13827" ), row.names = c(NA, 11L), class = "data.frame")
# snp CHROMOSOME COUNT_ALLELE OTHER_ALLELE 11688 11663 11533 13827
# 1 AL0009 1 A G 0 0 0 1
# 2 AL00014 1 A G 0 0 0 1
# 3 AL0021 1 A C 0 0 0 1
# 4 AL00046 1 G A 2 1 2 1
# 5 AL0047 1 A G 2 1 2 1
# 6 AS0005 1 A C 0 0 0 0
# 7 AS0014 1 A C 0 0 0 0
# 8 AS00021 1 C A 0 1 0 0
# 9 AS0047 1 G A 0 0 1 1
# 10 AS0071 1 G A 0 0 0 1
# 11 DR0001 1 A G 2 1 2 2
使用 TranslateAllele
函数我想用相应的两个字母替换从第5列开始的列中的数字代码:
using the TranslateAllele
function I want to replace the numbers in columns starting at column 5 by the corresponding two letter codes:
TranslateAllele <- function(COUNT_ALLELE, OTHER_ALLELE, genotype){
if(genotype==0){
print(paste(OTHER_ALLELE, OTHER_ALLELE, sep=""))
} else if(genotype==1){
print(paste(COUNT_ALLELE, OTHER_ALLELE, sep=""))
} else if(genotype==2){
print(paste(COUNT_ALLELE, COUNT_ALLELE, sep=""))
}
}
所以期望的输出如下:
# snp CHROMOSOME COUNT_ALLELE OTHER_ALLELE 11688 11663 11533 13827
# 1 AL0009 1 A G GG GG GG AG
# 2 AL00014 1 A G GG GG GG AG
# 3 AL0021 1 A C CC CC CC AC
# 4 AL00046 1 G A GG GA GG GA
# 5 AL0047 1 A G AA AG AA AG
# 6 AS0005 1 A C CC CC CC CC
# 7 AS0014 1 A C CC CC CC CC
# 8 AS00021 1 C A AA CA AA AA
# 9 AS0047 1 G A AA AA GA GA
# 10 AS0071 1 G A AA AA AA GA
# 11 DR0001 1 A G AA AG AA AA
最终我需要为1M列的1.6M行执行此操作,因此我将无法简单地使用for循环:(
Eventually I need to do this for 1.6M rows by 1M columns, so I won't be able to simply use a for loop:(
推荐答案
我倾向于避免 ifelse
。它有一些严重的缺点。以下是效率和简单性之间的折衷:
I tend to avoid ifelse
. It has some serious disadvantages. The following is a compromise between efficiency and simplicity:
df[, 5:8] <- lapply(df[, 5:8], function(x, a, b) {
x[x == 0] <- paste0(b, b)[x == 0]
x[x == 1] <- paste0(a, b)[x == 1]
x[x == 2] <- paste0(a, a)[x == 2]
x
}, a = df$COUNT_ALLELE, b = df$OTHER_ALLELE)
# snp CHROMOSOME COUNT_ALLELE OTHER_ALLELE 11688 11663 11533 13827
# 1 AL0009 1 A G GG GG GG AG
# 2 AL00014 1 A G GG GG GG AG
# 3 AL0021 1 A C CC CC CC AC
# 4 AL00046 1 G A GG GA GG GA
# 5 AL0047 1 A G AA AG AA AG
# 6 AS0005 1 A C CC CC CC CC
# 7 AS0014 1 A C CC CC CC CC
# 8 AS00021 1 C A AA CA AA AA
# 9 AS0047 1 G A AA AA GA GA
# 10 AS0071 1 G A AA AA AA GA
# 11 DR0001 1 A G AA AG AA AA
但是,您的数据集有很多列。因此,您应该将data.frame重新整形为长格式(假设您有足够的内存)以避免循环:
However, your dataset has many columns. You should therefore reshape your data.frame to long format (provided you have sufficient memory) in order to avoid the loop:
library(reshape2)
dfmelt <- melt(df, id.vars = c("snp", "CHROMOSOME", "COUNT_ALLELE", "OTHER_ALLELE"))
dfmelt$code <- paste0(df$OTHER_ALLELE, df$OTHER_ALLELE)
dfmelt[dfmelt$value == 1L,] <- within(dfmelt[dfmelt$value == 1L,], code <- paste0(COUNT_ALLELE, OTHER_ALLELE))
dfmelt[dfmelt$value == 2L,] <- within(dfmelt[dfmelt$value == 2L,], code <- paste0(COUNT_ALLELE, COUNT_ALLELE))
当然,你的数据太大了,你真的可以从使用包数据中受益。表:
And of course, your data is so large that you would really benefit from using package data.table:
library(data.table)
setDT(df)
dfmelt <- melt(df, id.vars = c("snp", "CHROMOSOME", "COUNT_ALLELE", "OTHER_ALLELE"))
dfmelt[value == 0L, code := paste0(OTHER_ALLELE, OTHER_ALLELE)]
dfmelt[value == 1L, code := paste0(COUNT_ALLELE, OTHER_ALLELE)]
dfmelt[value == 2L, code := paste0(COUNT_ALLELE, COUNT_ALLELE)]
如果必须,你可以 dcast
最后将格式化的长格式data.frame / data.table改为宽格式。但是没有理由这样做。
If you must, you can dcast
the long-format data.frame/data.table to wide format in the end. But there shouldn't be a reason to do that.
这篇关于比较两列和更改第三列时如何使用ifelse?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!