通过多个列嵌套if else语句 [英] Nested if else statements over a number of columns
问题描述
我有一个大的 data.frame
其中前三列包含有关标记的信息。剩余的列是每个人中该标记的数字类型。每个人都有三列。数据集如下所示:
I have a large data.frame
where the first three columns contain information about a marker. The remaining columns are of numeric type for that marker in each individual. Each individual has three columns. The dataset looks as follows:
marker alleleA alleleB X818 X818.1 X818.2 X345 X345.1 X345.2 X346 X346.1 X346.2
1 kgp5209280_chr3_21902067 T A 0.0000 1.0000 0.0000 1.0000 0.0000 0.0000 0.0000 1.0000 0.0000
2 chr3_21902130_21902131_A_T A T 0.8626 0.1356 0.0018 0.7676 0.2170 0.0154 0.8626 0.1356 0.0018
3 chr3_21902134_21902135_T_C T C 0.6982 0.2854 0.0164 0.5617 0.3749 0.0634 0.6982 0.2854 0.0164
就是说,对于每个标记(行),每个人都有三个值,每个列。
That is, for each marker (row), each individual has three values, one in each column.
我想创建一个新的 data.frame
,它们与原始行全部相同,但每个人只有一列。在每个人的一列中,我想要为每个人大于0.8的三个值。如果没有值大于0.8,那么我想打印NA。例如,在第一行给出的数据集中,我想要第二个值为818(1.0000),第一个值为345(1.0000)。在第二行,我想要第一个值为818(0.8626),而345中没有值超过0.8,所以我想打印NA,等等。因此,新数据集将如下所示:
I want to create a new data.frame
which has all the same rows as in the original, but only one column per individual. In the one column for each individual I want the value out of the three for each individual which is greater than 0.8. If no value is greater than 0.8 then I want to print NA. For instance, in the data set I have given for the first row I would want the second value for 818 (1.0000), and the first value for 345 (1.0000). In the second row, I want the first value for 818 (0.8626), and for 345 none of the values are above 0.8 so I want NA to be printed and so on. The new data set would therefore look like this:
marker alleleA alleleB X818 X345
1 kgp5209280_chr3_21902067 T A 1.0000 1
2 chr3_21902130_21902131_A_T A T 0.8626 NA
我一直在尝试使用 if / else
语句,沿 if [,4]> 0.8然后[,4],否则...
然而,它似乎没有给我我想要的,我也必须循环这个命令,所以它不只是为一个个人在前三列,但所有列。
I have been trying to use if/else
statements, along the lines of if [, 4] > 0.8 then [, 4], else...
however it doesn't seem to give me what I want, and I would also have to loop this command so it doesn't just do it for one individual in the first three columns but for all columns.
任何帮助将不胜感激!感谢提前。
Any help would be appreciated! Thanks in advance.
推荐答案
编辑:使用在数据中实现的快速融合/ dcast方法的更新解决方案。表
version> = 1.9.0。 此处 获取更多信息。
Updated solution using the fast melt/dcast methods implemented in data.table
versions >= 1.9.0. Go here for more info.
require(data.table)
require(reshape2)
dt <- as.data.table(df)
# melt data.table
dt.m <- melt(dt, id=c("marker", "alleleA", "alleleB"),
variable.name="id", value.name="val")
dt.m[, id := gsub("\\.[0-9]+$", "", id)] # replace `.[0-9]` with nothing
# aggregation
dt.m <- dt.m[, list(alleleA = alleleA[1],
alleleB = alleleB[1], val = max(val)),
keyby=list(marker, id)][val <= 0.8, val := NA]
# casting back
dt.c <- dcast.data.table(dt.m, marker + alleleA + alleleB ~ id)
# marker alleleA alleleB X345 X346 X818
# 1: chr3_21902130_21902131_A_T A T NA 0.8626 0.8626
# 2: chr3_21902134_21902135_T_C T C NA NA NA
# 3: kgp5209280_chr3_21902067 T A 1 1.0000 1.0000
解决方案1:可能不是最好的方法,但这是我现在可以想到的:
Solution 1: Probably not the best way, but this is what I could think of at the moment:
mm <- t(apply(df[-(1:3)], 1, function(x) tapply(x, gl(3,3), max)))
mode(mm) <- "numeric"
mm[mm < 0.8] <- NA
# you can set the column names of mm here if necessary
out <- cbind(df[, 1:3], mm)
# marker alleleA alleleB 1 2 3
# 1 kgp5209280_chr3_21902067 T A 1.0000 1 1.0000
# 2 chr3_21902130_21902131_A_T A T 0.8626 NA 0.8626
# 3 chr3_21902134_21902135_T_C T C NA NA NA
gl(3,3)
给出值为 1的因子,1,1,2,2,3,3,3
,级别 1,2,3
。也就是说,直拨
将一次获取 x
3的值,并获得 max
(前3,下3和最后3)。而应用
逐行发送。
gl(3,3)
gives a factor with values 1,1,1,2,2,2,3,3,3
with levels 1,2,3
. That is, tapply
will take the values x
3 at a time and get their max
(first 3, next 3 and the last 3). And apply
sends each row one by one.
解决方案2:一个 data.table
解决方案与融合
和 cast
data.table
without 使用 reshape
或 reshape2
:
Solution 2: A data.table
solution with melt
and cast
within data.table
without using reshape
or reshape2
:
require(data.table)
dt <- data.table(df)
# melt your data.table to long format
dt.melt <- dt[, list(id = names(.SD), val = unlist(.SD)),
by=list(marker, alleleA, alleleB)]
# replace `.[0-9]` with nothing
dt.melt[, id := gsub("\\.[0-9]+$", "", id)]
# get max value grouping by marker and id
dt.melt <- dt.melt[, list(alleleA = alleleA[1],
alleleB = alleleB[1],
val = max(val)),
keyby=list(marker, id)][val <= 0.8, val := NA]
# edit mnel (use setattr(,'names') to avoid copy by `names<-` within `setNames`
dt.cast <- dt.melt[, as.list(setattr(val,'names', id)),
by=list(marker, alleleA, alleleB)]
# marker alleleA alleleB X345 X346 X818
# 1: chr3_21902130_21902131_A_T A T NA 0.8626 0.8626
# 2: chr3_21902134_21902135_T_C T C NA NA NA
# 3: kgp5209280_chr3_21902067 T A 1 1.0000 1.0000
这篇关于通过多个列嵌套if else语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!