在一个单元格(定界字符串)中选择最小值或最大值 [英] Select min or max values within one cell (delimited string)

查看:158
本文介绍了在一个单元格(定界字符串)中选择最小值或最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中每个样本的列可以具有多个值,例如:

I have a data frame where for each sample the columns can have multiple values, for example:

Gene       Pvalue1             Pvalue2              Pvalue3                  Beta
Ace    0.0381, ., 0.00357    0.01755, 0.001385    0.0037, NA , 0.039         -0.03,1,15
NOS          NA                  0.02              0.001, 0.00067              0.00009,25,30

我想在每个列中为每个基因的数据(总共有成千上万个基因)应用min()max(),并获得p值的最小值,但β等列的最大值.因此输出数据将如下所示:

I want to apply min() and max() for each gene's data (I have thousands of genes in total) in each column and get the smallest value for the pvalues but the largest value for columns such as the beta. So the output data would look like this:

Gene       Pvalue1             Pvalue2              Pvalue3                  Beta
Ace        0.00357              0.001385             0.0037                   15
NOS          NA                  0.02                0.00067                  30

我是R的新手,不知道我要问的内容是否可能,如果一个单元格中有多个值,它们是否被视为字符串?

I'm new to R and not sure if what I'm asking is possible, if there are multiple values in one cell are they viewed as strings?

推荐答案

使用stringrdplyr的可能解决方案:

A possible solution using stringr and dplyr:

library(dplyr)
library(stringr)

getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x) min(as.numeric(x),na.rm = T) ) %>%
  unlist() 

df %>%
  mutate_at(names(df)[-1],getmin)

  Gene Pvalue1  Pvalue2 Pvalue3  Beta
1  Ace 0.00357 0.001385 0.00370 -3e-02
2  NOS     Inf 0.020000 0.00067 9e-05

Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In min(as.numeric(x), na.rm = T) :
  no non-missing arguments to min; returning Inf

函数getminstr_extract_all提取数字:

 str_extract_all(df$Pvalue2,"[0-9\\.-]+")

[[1]]
[1] "0.01755"  "0.001385"

[[2]]
[1] "0.02"

它的优点是对空格或其他字符不敏感,但只能提取一个点.然后,我在此列表上循环以在每个单元格中提取最小值,然后使用unlist将列表转换为向量.使用as.numeric()函数将可能提取的.转换为NA.

It has the advantage of being insensible to space or other characters, but can extract just a dot. I then loop on this list to extract in each cell the minimum, and convert the list into a vector with unlist. Using the as.numeric() function convert the possible extracted . to NA.

代码df %>% mutate_at(names(df)[-1],getmin)仅将此功能应用于除第一列以外的所有列

the code df %>% mutate_at(names(df)[-1],getmin) just apply this function on all columns exept the first one

如果要避免使用inf值,则可以使用此稍作修改的版本:

edit: if you want to avoid inf values, you can use this slight modified version:

min2 = function(x) if(all(is.na(x))) NA else min(x,na.rm = T)
getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x)min2(as.numeric(x)) ) %>%
  unlist() 

df %>%
    mutate_at(names(df)[-1],getmin)

  Gene Pvalue1  Pvalue2 Pvalue3  Beta
1  Ace 0.00357 0.001385 0.00370 -3e-02
2  NOS      NA 0.020000 0.00067 9e-05


数据:


data:

df <- read.table(text = "
                 Gene       Pvalue1             Pvalue2              Pvalue3                  Beta
Ace    0.0381,.,0.00357    0.01755,0.001385    0.0037,NA,0.039         -0.03,1,15
                 NOS          NA                  0.02              0.001,0.00067              0.00009,25,30
                 ",header = T)

这篇关于在一个单元格(定界字符串)中选择最小值或最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆