如何为R中的一行中的多个值设置多个条件? [英] How to set multiple conditions for multiple values in a row in R?

查看:51
本文介绍了如何为R中的一行中的多个值设置多个条件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个遗传数据集,其中每一行都描述一个基因,并且有一个带有多个beta值的beta列,我已将其压缩成一行/单元格(来自一个基因中多个变体产生多个beta的变体水平).β是基因在一定条件下可能具有的效应大小,因此大的负值和大的正值都很重要.我正在尝试编写选择一个基因的最大负β值或最大正β值的代码,截取值分别为-0.5和0.5.

我要编写的规则如下:

如果一个基因/行的值小于-0.5,且值不大于0.5,则仅保留最大的负值.

如果其值大于0.5而没有小于-0.5的值,则仅保留最大的正值.

如果其值不小于-0.5或大于0.5,则保留最大值.

如果两个值均小于-0.5并且大于0.5,则保持最大值.

例如,我的数据如下:

 基因BetaACE 0.01,-0.6、0.4BRCA 0.7,-0.2、0.2ZAP70 0.001,0.02,-0.003P53 0.8,-0.6、0.001 

预期输出(根据设置条件选择最大的负值或正值):

 基因BetaACE -0.6BRCA 0.7ZAP70 0.02P53 0.8 

我来自生物学背景,是R的新手,所以不确定如何编码.目前,我正在使用函数来选择基因的最大β值或最小β值,但是我不知道如何在进一步的条件下对此进行修改:

  max2 = function(x)if(all(is.na(x)))NA else max(x,na.rm = T)getmax = function(col)str_extract_all(col,"[0-9 \\ .-] +")%&%;%lapply(.,function(x)max2(as.numeric(x)))%>%unlist()min2 =函数(x)if(all(is.na(x)))NA不存在min(x,na.rm = T)getmin = function(col)str_extract_all(col,"[0-9 \\ .-] +")%&%;%lapply(.,function(x)min2(as.numeric(x)))%>%unlist()测试<-df%&%;%mutate_at(names(df)[2],getmax) 

在正确的方向上设置多个条件语句的任何帮助将不胜感激.

示例数据:

  dput(df)结构(列表(基因= c("ACE","BRCA","ZAP70","P53"),测试版" = c("0.01,-0.6、0.4","0.7,-0.2,0.2","0.001,0.02,-0.003","0.8,-0.6,0.001")),row.names = c(NA,-4L),类= c("data.table","data.frame")) 

解决方案

这是一个data.table解决方案,该解决方案应该可以快速运行并且独立于所提供的Beta数量.

 库(data.table)库(matrixStats)#将df设置为data.tablesetDT(df)#将Beta(s)拆分到列(动态)df [,paste0("Beta",1:length(tstrsplit(df $`Beta(s)`,,")))):=lapply(tstrsplit(`Beta(s)`,,"),as.numeric)] []#基因Beta Beta1 Beta2 Beta3#1:ACE 0.01,-0.6、0.4 0.010 -0.60 0.400#2:BRCA 0.7,-0.2、0.2 0.700 -0.20 0.200#3:ZAP70 0.001、0.02,-0.003 0.001 0.02 -0.003#4:P53 0.8,-0.6、0.001 0.800 -0.60 0.001#now,使用matrixStats-package中的rowMINs和RowMAxs(= FAST !!)#通过引用获取过滤(和更新).#如果一个基因/行的值小于-0.5,且值不大于0.5,则仅保留最大的负值.df [df [,rowMins(as.matrix(.SD),na.rm = TRUE),.SDcols = pattern("^ Beta [0-9]")]<-0.5&df [,rowMaxs(as.matrix(.SD),na.rm = TRUE),.SDcols = pattern("^ Beta [0-9]")]< = 0.5,Beta.final:= rowMins(as.matrix(.SD),na.rm = TRUE),.SDcols = patterns("^ Beta [0-9]")]#如果其值大于0.5,且没有一个值小于-0.5,则仅保留最大的正值.df [df [,rowMaxs(as.matrix(.SD),na.rm = TRUE),.SDcols = patterns("^ Beta [0-9]")]>0.5和df [,rowMins(as.matrix(.SD),na.rm = TRUE),.SDcols = patterns("^ Beta [0-9]")]> = -0.5,Beta.final:= rowMaxs(as.matrix(.SD),na.rm = TRUE),.SDcols = patterns("^ Beta [0-9]")]#如果其值不小于-0.5或大于0.5,则保留最大值.df [df [,rowMins(as.matrix(.SD),na.rm = TRUE),.SDcols = pattern("^ Beta [0-9]")]> = -0.5&df [,rowMaxs(as.matrix(.SD),na.rm = TRUE),.SDcols = pattern("^ Beta [0-9]")]< = 0.5,Beta.final:= rowMaxs(as.matrix(.SD),na.rm = TRUE),.SDcols = patterns("^ Beta [0-9]")]#如果两个值均小于-0.5并且大于0.5,则保持最大值.df [df [,rowMins(as.matrix(.SD),na.rm = TRUE),.SDcols = pattern("^ Beta [0-9]")]<-0.5&df [,rowMaxs(as.matrix(.SD),na.rm = TRUE),.SDcols = pattern("^ Beta [0-9]")]>0.5,Beta.final:= rowMaxs(as.matrix(.SD),na.rm = TRUE),.SDcols = patterns("^ Beta [0-9]")] 

*输出

  #final输出df [,.(Gene,`Beta(s)= Beta.final)] []#基因Beta#1:ACE -0.60#2:BRCA 0.70#3:ZAP70 0.02#4:P53 0.80 

I have a genetic data set where each row describes a gene and has a beta column with multiple beta values I've compressed into one row/cell (from the variant level where multiple variants in one gene gave multiple betas). The beta is the effect size that the gene can have on a condition so large negative values are important as well as large positive values. I am trying to write code that selects either the largest negative or largest positive beta value for a gene, cutting off at -0.5 and 0.5.

The rules I am trying to code are these:

If a gene/row has a value less than -0.5 and no values higher than 0.5 then keep only the largest negative value.

If it has a value higher than 0.5 and no values less than -0.5 keep only the largest positive value.

If it has no values less than -0.5 or more than 0.5 keep the largest value.

If it has both values less than -0.5 and more than 0.5 keep the largest value.

For example my data looks like this:

Gene    Beta(s)
ACE     0.01, -0.6, 0.4
BRCA    0.7, -0.2, 0.2 
ZAP70   0.001, 0.02, -0.003
P53     0.8, -0.6, 0.001

Expected output (selecting largest negative or positive values depending on set conditions):

Gene    Beta(s)
ACE     -0.6  
BRCA     0.7
ZAP70    0.02
P53      0.8   

I am from a biology background and new to R, so not sure how to code this. At the moment I am working with functions to select either the maximum or minimum beta values for a gene, but I don't know how to amend this with further conditions:

max2 = function(x) if(all(is.na(x))) NA else max(x,na.rm = T)
getmax = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x)max2(as.numeric(x)) ) %>%
  unlist() 

min2 = function(x) if(all(is.na(x))) NA else min(x,na.rm = T)
getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x)min2(as.numeric(x)) ) %>%
  unlist() 

test <- df %>%
  mutate_at(names(df)[2],getmax)

Any help in the right direction of how to set multiple conditional statements would be appreciated.

Example data:

 dput(df)
structure(list(Gene = c("ACE", "BRCA", "ZAP70", "P53"), `Beta(s)` = c("0.01, -0.6, 0.4", 
"0.7, -0.2, 0.2", "0.001, 0.02, -0.003", "0.8, -0.6, 0.001")), row.names = c(NA, 
-4L), class = c("data.table", "data.frame"))

解决方案

Here is a data.table solution that should work fast and indepentant of the number of beta's provided.

library( data.table )
library( matrixStats ) 
#set df as data.table
setDT( df )
#split Beta(s) to columns (dynamically)
df[, paste0( "Beta", 
             1:length( tstrsplit( df$`Beta(s)`, "," ) ) ) := 
     lapply( tstrsplit( `Beta(s)`, "," ), as.numeric ) ][]
#     Gene             Beta(s) Beta1 Beta2  Beta3
# 1:   ACE     0.01, -0.6, 0.4 0.010 -0.60  0.400
# 2:  BRCA      0.7, -0.2, 0.2 0.700 -0.20  0.200
# 3: ZAP70 0.001, 0.02, -0.003 0.001  0.02 -0.003
# 4:   P53    0.8, -0.6, 0.001 0.800 -0.60  0.001


#now, using rowMINs ans RowMAxs from the matrixStats-package (=FAST!!)
# get the filtering (and updating) done by reference.

#If a gene/row has a value less than -0.5 and no values higher than 0.5 then keep only the largest negative value.
df[ df[, rowMins( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ] < -0.5 &
      df[, rowMaxs( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ] <= 0.5,
    Beta.final := rowMins( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ]
#If it has a value higher than 0.5 and no values less than -0.5 keep only the largest positive value.
df[ df[, rowMaxs( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ] > 0.5 &
      df[, rowMins( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ] >= -0.5,
    Beta.final := rowMaxs( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ]
#If it has no values less than -0.5 or more than 0.5 keep the largest value.
df[ df[, rowMins( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ] >= -0.5 &
      df[, rowMaxs( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ] <= 0.5,
    Beta.final := rowMaxs( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ]
#If it has both values less than -0.5 and more than 0.5 keep the largest value.
df[ df[, rowMins( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ] < -0.5 &
      df[, rowMaxs( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ] > 0.5,
    Beta.final := rowMaxs( as.matrix(.SD), na.rm = TRUE ), .SDcols = patterns("^Beta[0-9]") ]

*output

#final output
df[, .(Gene, `Beta(s)` = Beta.final )][]
#     Gene Beta(s)
# 1:   ACE   -0.60
# 2:  BRCA    0.70
# 3: ZAP70    0.02
# 4:   P53    0.80

这篇关于如何为R中的一行中的多个值设置多个条件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆