使用列名作为变量构造函数 [英] constructing a function using colnames as variables
问题描述
我想在 annot data.frame
的多列下收集术语.下面是 annot 玩具数据集的第一行信息.
I'd like to collect terms under multiple columns of the annot data.frame
.
Below is the first row of information for a toy datset for annot.
colnames(annot)
# [1] "HUGO.Name" "Common.Name" "Gene.Class" "Cell.Type" "Annotation"
annot[1,]
# HUGO.Name Common.Name Gene.Class Cell.Type
# 1 CCL1 CCL1 Immune Response - Cell Type specific aDC
# Annotation
# 1 Cell Type specific, Chemokines and receptors, Inflammatory response
到目前为止,我一直在迭代地编写 colnames
,但我想学习如何编写一个函数来循环遍历 annot 的所有列(以及更一般的其他 data.frames
).
So far, I've been writing the colnames
iteratively, but I'd like to learn how to write a function to loop through all columns of annot (and more generally other data.frames
).
这是我的手动方法:
yA <- unique(str_trim(unlist(strsplit(annot[, "Annotation"], ","))))
yC <- unique(str_trim(unlist(strsplit(annot[, "Cell.Type"], ","))))
yA
# [1] "Cell Type specific" "Chemokines and receptors"
# [3] "Inflammatory response" "Cytokines and receptors"
# [5] "Chronic inflammatory response" "Th2 orientation"
# [7] "T-cell proliferation" "Defense response to virus"
# [9] "B-cell receptor signaling pathway" "CD molecules"
# [11] "Regulation of immune response" "Adaptive immune response"
# [13] "Antigen processing and presentation"
如何构造函数y"来简化此过程?我尝试了以下方法:
How can I construct a function "y" to simplify this process? I've tried the following:
y <- function (i,n) {unique(str_trim(unlist(strsplit(i[, as.name(n)], ","))))}
但是,当我尝试使用此功能时出现错误.
However, I get an error when I try to use this function.
yA <- y(annot, Annotation)
# Error in .subset(x, j) : invalid subscript type 'symbol'
# Called from: `[.data.frame`(i, , as.name(n))
我打算使用yA和yC的输出来制作列表如下:
What I intend is to use the output of yA and yC to make lists as follows:
# look up associated HUGO.Name per each term of yA
for (i in yA) {
eval(call("<-", as.name(i),
annot[grepl(i, annot[,"Annotation"], fixed =T), "HUGO.Name"]))
}
# make lists
nSannot_list<- mget(yA)
推荐答案
让我们假设你的 data.frame
是这样开始的:
Let's assume you're starting with something like this as your data.frame
:
mydf <- data.frame(
v1 = c("A, B, B", "A, C,D"),
v2 = c("E, F", " G,H , E, I"),
v3 = c("J,K,L,M", "N, J, L, M, K"))
mydf
# v1 v2 v3
# 1 A, B, B E, F J,K,L,M
# 2 A, C,D G,H , E, I N, J, L, M, K
定义函数的一种方式如下所示.我一直坚持使用基本函数,但如果您愿意,也可以使用stringr".
One way you can define your function would be like the following. I've stuck to base functions, but you can use "stringr" if you prefer.
myFun <- function(instring) {
if (!is.character(instring)) instring <- as.character(instring)
unique(trimws(unlist(strsplit(instring, ",", fixed = TRUE))))
}
第一行只是检查输入是否是字符串.通常,在 data.frame
中,默认情况下使用 stringsAsFactors = TRUE
读取数据,因此您需要先执行该转换.第二行进行拆分和修剪.为了提高效率,我在其中添加了 fixed = TRUE
.
The first line just checks to see if the input is a character string or not. Often, in data.frame
s, data is read in with stringsAsFactors = TRUE
by default, so you need to perform that conversion first. The second line does the splitting and trimming. I've added a fixed = TRUE
in there for efficiency.
一旦你有了这样的函数,你就可以使用 apply
(对于 data.frame
或 matrix
,或者通过行或按列)或使用 lapply
(用于 list
或 data.frame
(按列)).
Once you have such a function, you can easily apply it using apply
(for a data.frame
or a matrix
, either by row or by column) or using lapply
(for a list
or a data.frame
(which would be by column)).
## If `mydf` is a data.frame, and you want to go by columns
lapply(mydf, myFun)
# $v1
# [1] "A" "B" "C" "D"
#
# $v2
# [1] "E" "F" "G" "H" "I"
#
# $v3
# [1] "J" "K" "L" "M" "N"
## `apply` can be used too. Second argument specifies whether by row or column
apply(mydf, 1, myFun)
apply(mydf, 2, myFun)
<小时>
另一方面,如果您希望创建一个接受输入数据集名称和(未加引号的)列的函数,您可以像这样编写函数:
If, on the other hand, you are looking to create a function that accepts the input dataset name and the (bare, unquoted) column, you can write your function like this:
myOtherFun <- function(indf, col) {
col <- deparse(substitute(col))
unique(trimws(unlist(strsplit(as.character(indf[, col]), ",", TRUE))))
}
第一行将裸列名称捕获为字符串,以便它可以以典型的my_data[, "col_wanted"]
形式使用.
The first line captures the bare column name as a character string so that it could be used in the typical my_data[, "col_wanted"]
form.
这是正在使用的函数:
myOtherFun(mydf, v2)
# [1] "E" "F" "G" "H" "I"
这篇关于使用列名作为变量构造函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!