使用列名作为变量构造函数 [英] constructing a function using colnames as variables

查看:18
本文介绍了使用列名作为变量构造函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 annot data.frame 的多列下收集术语.下面是 annot 玩具数据集的第一行信息.

I'd like to collect terms under multiple columns of the annot data.frame. Below is the first row of information for a toy datset for annot.

colnames(annot)
# [1] "HUGO.Name"   "Common.Name" "Gene.Class"  "Cell.Type"   "Annotation" 
annot[1,]
#   HUGO.Name Common.Name                           Gene.Class Cell.Type
# 1      CCL1        CCL1 Immune Response - Cell Type specific       aDC
#                                                            Annotation
# 1 Cell Type specific, Chemokines and receptors, Inflammatory response

到目前为止,我一直在迭代地编写 colnames,但我想学习如何编写一个函数来循环遍历 annot 的所有列(以及更一般的其他 data.frames).

So far, I've been writing the colnames iteratively, but I'd like to learn how to write a function to loop through all columns of annot (and more generally other data.frames).

这是我的手动方法:

yA <- unique(str_trim(unlist(strsplit(annot[, "Annotation"], ","))))
yC <- unique(str_trim(unlist(strsplit(annot[, "Cell.Type"], ","))))

yA
#  [1] "Cell Type specific"                  "Chemokines and receptors"           
#  [3] "Inflammatory response"               "Cytokines and receptors"            
#  [5] "Chronic inflammatory response"       "Th2 orientation"                    
#  [7] "T-cell proliferation"                "Defense response to virus"          
#  [9] "B-cell receptor signaling pathway"   "CD molecules"                       
# [11] "Regulation of immune response"       "Adaptive immune response"           
# [13] "Antigen processing and presentation"

如何构造函数y"来简化此过程?我尝试了以下方法:

How can I construct a function "y" to simplify this process? I've tried the following:

y <- function (i,n) {unique(str_trim(unlist(strsplit(i[, as.name(n)], ","))))}

但是,当我尝试使用此功能时出现错误.

However, I get an error when I try to use this function.

yA <- y(annot, Annotation)
# Error in .subset(x, j) : invalid subscript type 'symbol'
# Called from: `[.data.frame`(i, , as.name(n))

我打算使用yA和yC的输出来制作列表如下:

What I intend is to use the output of yA and yC to make lists as follows:

# look up associated HUGO.Name per each term of yA
for (i in yA) {
eval(call("<-", as.name(i),
              annot[grepl(i, annot[,"Annotation"], fixed =T), "HUGO.Name"]))
}  
# make lists 
nSannot_list<- mget(yA)

推荐答案

让我们假设你的 data.frame 是这样开始的:

Let's assume you're starting with something like this as your data.frame:

mydf <- data.frame(
  v1 = c("A, B, B", "A, C,D"), 
  v2 = c("E, F", " G,H , E, I"), 
  v3 = c("J,K,L,M", "N, J, L, M, K"))

mydf
#        v1          v2            v3
# 1 A, B, B        E, F       J,K,L,M
# 2  A, C,D  G,H , E, I N, J, L, M, K

定义函数的一种方式如下所示.我一直坚持使用基本函数,但如果您愿意,也可以使用stringr".

One way you can define your function would be like the following. I've stuck to base functions, but you can use "stringr" if you prefer.

myFun <- function(instring) {
  if (!is.character(instring)) instring <- as.character(instring)
  unique(trimws(unlist(strsplit(instring, ",", fixed = TRUE))))
}

第一行只是检查输入是否是字符串.通常,在 data.frame 中,默认情况下使用 stringsAsFactors = TRUE 读取数据,因此您需要先执行该转换.第二行进行拆分和修剪.为了提高效率,我在其中添加了 fixed = TRUE.

The first line just checks to see if the input is a character string or not. Often, in data.frames, data is read in with stringsAsFactors = TRUE by default, so you need to perform that conversion first. The second line does the splitting and trimming. I've added a fixed = TRUE in there for efficiency.

一旦你有了这样的函数,你就可以使用 apply(对于 data.framematrix,或者通过行或按列)或使用 lapply(用于 listdata.frame(按列)).

Once you have such a function, you can easily apply it using apply (for a data.frame or a matrix, either by row or by column) or using lapply (for a list or a data.frame (which would be by column)).

## If `mydf` is a data.frame, and you want to go by columns
lapply(mydf, myFun) 
# $v1
# [1] "A" "B" "C" "D"
# 
# $v2
# [1] "E" "F" "G" "H" "I"
# 
# $v3
# [1] "J" "K" "L" "M" "N"

## `apply` can be used too. Second argument specifies whether by row or column
apply(mydf, 1, myFun)
apply(mydf, 2, myFun)

<小时>

另一方面,如果您希望创建一个接受输入数据集名称和(未加引号的)列的函数,您可以像这样编写函数:


If, on the other hand, you are looking to create a function that accepts the input dataset name and the (bare, unquoted) column, you can write your function like this:

myOtherFun <- function(indf, col) {
  col <- deparse(substitute(col))
  unique(trimws(unlist(strsplit(as.character(indf[, col]), ",", TRUE))))
}

第一行将裸列名称捕获为字符串,以便它可以以典型的my_data[, "col_wanted"] 形式使用.

The first line captures the bare column name as a character string so that it could be used in the typical my_data[, "col_wanted"] form.

这是正在使用的函数:

myOtherFun(mydf, v2)
# [1] "E" "F" "G" "H" "I"

这篇关于使用列名作为变量构造函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆