学习在R中编写函数 [英] Learning to write functions in R
问题描述
我是R的角色,我想开始编写自己的函数,因为我倾向于需要一遍又一遍地做同样的事情。然而,我正在努力看到我能如何概括我写的东西。查看源代码并没有帮助我学习得非常好,因为经常它看起来很像。内部或.Primitive函数(或其他我不知道的函数)被广泛使用。我想简单地将我的正常复制粘贴解决方案转变为功能 - 更奇特的事情可能会晚一些!
作为一个例子:我做了很多数据格式化需要进行一些操作,然后为所有其他没有任何数据的组合(例如,没有观察并因此未被记录的年份等)填充零的数据帧。我需要对具有不同变量集的不同数据集进行反复处理,但想法和实现始终是相同的。
我解决这个问题的非函数方式(对于特定实现和最小示例):
<$ p数据框架(县= c(1,45,57),
年= c(2002,2003,2003),
等级= c (平均值,平均值,平均值),
Obs = c(1.4,1.9,10.2))
#创建数据帧的扩展版本
县< ; - seq(从= 1到= 77,by = 2)
年份< - seq(从= 1999到= 2014,by = 1)
级别<-c(Max ,Mean)
扩展< - expand.grid(县,年,层次)
扩展[4] < - 0
colnames(扩展)< - colnames(df )
#对它们进行合并和排序,以使得观察值位于顶部
df_full< - merge(Expansion,df,all = TRUE)
df_full $ duplicate< - (df_full,
paste(Year,County,Level))
df_full< - df_full [order(df_full $ Year,
df_full $ County,
df_full $ Level,
-abs(df_full $ Obs)),]
#重复出现的第一个(观察)
df_full< - df_full [!重复(df_full $ duplicate),]
df_full $ duplicate< - NULL
我会喜欢概括这个,这样我可以以某种方式放入一个数据框(并且可能选择我需要排序的列,因为有时候会更改),然后获取扩展版本。我的第一个实现包含了一个参数太多的函数(数据框,然后是我想要命令/ expand.grid的所有列名),但它也不起作用:
gridExpand< - 函数(df,col1,col2 = NULL,col3 = NULL,measure){
#开始于Expansion是函数
#它与上面代码的第一部分相同
ex (Expansion,df,all = TRUE)
ex $ dupe < - with(ex,
$(col1,col2,col3))
ex < - ex [order(with(ex,
col1,col2,col3,-abs(measure)))]
ex < - ex [!duplicated(ex $ dupe)]
ex< - subset(ex,select = - (dupe))
}
df_full< - gridExpand (df,Year,County,Level,Obs)
粘贴错误(col1,col2,col3):找不到对象'Year'
我假设这是行不通的,因为R无法知道 理想情况下,我想知道一些可能有助于思考泛化的资源,或者如果有人能指出我正确解决这个特定问题的方向,我认为它可能会帮助我看到我做错了什么。我不知道有更好的方法来寻求帮助 - 我一直在尝试阅读大约3个月编写函数的教程,而不是点击。 乍看之下,您可以做的最大的事情是 使用非标准()函数中的评估快捷键:像 这里的问题是$符号是一个神奇的快捷方式,并且像 大多数R初学者迟早会被这个太方便的快捷键咬住。作为一个R 当你开始编写在数据框架上工作的函数时,如果你需要引用列名,你应该把它们作为字符串传入,然后使用 这些收益率: 第一个函数是错误的,因为 如果您想更多地了解这个问题,我建议您使用 Hadley Wickham的高级R书的非标准评估部分。 我不打算重新编译它,编写和调试你的函数,但如果我想要第一步将删除所有 I am the point with R where I would like to start writing my own functions because I tend to need to do the same things over and over. However, I am struggling to see how I can generalize what I write. Looking at source code has not helped me learn very well because often it seems that .Internal or .Primitive functions (or other commands I do not know) are used extensively. I would like to simply start by turning my normal copy-pasted solutions into functions - fancier things can come later! As an example: I do a lot of data formatting that requires doing some operation, and then filling in a data frame with zeros for all other combinations that did not have any data (e.g., years that did not have observations and were therefore not originally recorded, etc). I need to do this over and over for different data sets that have different sets of variables, but the idea and implementation is always the same. My non-function way of solving this has been (for a specific implementation and minimal example): I would like to generalize this so that I could somehow put in a data frame (and probably select the columns I need to order by since that sometimes changes) and then get the expanded version out. My first implementation consisted of a function with too many arguments (the data-frame and then all the column names I wanted to order/expand.grid by) and it also did not work: I am assuming that this did not work because R has no way to know where I would ideally like to know of some resources that could help with thinking about generalization, or if someone can point me in the right direction to solving this particular problem I think it might help me see what I am doing wrong. I do not know of a better way to ask for help - I have been trying to read tutorials on writing functions for about 3 months and it is not clicking. At a glance, the biggest thing that you can do is to not use non-standard-evaluation shortcuts inside your functions: things like
The problem here is that the $ notation is a magical shortcut and like
any other magic if used incorrectly is likely to do the programmatic
equivalent of turning yourself into a toad. -- Greg Snow (in
response to a user that wanted to access a column whose name is stored
in
Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R
newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable
consequences. It's best to acquire the When you start writing functions that work on data frames, if you need to reference column names you should pass them in as strings, and then use These yield: The first function is wrong because If you want lots more understanding of this issue, I'd recommend the Non-Standard Evaluation Section of Hadley Wickham's Advanced R book. I'm not going to re-write and debug your functions, but if I wanted to my first step would be to remove all 这篇关于学习在R中编写函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!'Year'从。我可以尝试
paste(df,$ Year)
,但它会创建df $ Year
,这明显不管用。我从来没有看到任何其他人在他们的功能中这么做,所以我很想念人们如何引用数据框相关功能中的东西。
$
, subset()
和
。这些功能旨在便于交互使用,而不是可扩展的程序使用。 (例如,请参阅?subset
中的警告,它应该可以添加到?with
, fortunes :: fortune(312)
, fortunes :: fortune(343)
。)
fortunes :: fortune(312)
一样,如果使用不正确的话,任何其他魔法都可能执行编程式的
等价物变成蟾蜍。 - Greg Snow(在
响应一个用户想要通过访问名为
而不是
的列存储在 y
中的列) x $ y x [[y]]
)
R-help(2012年2月)
fortunes :: fortune(343)
新手,将R视为您的银行账户:过度使用$ -extraction可能导致不良的
后果。最好先获得 [[
和 [
习惯)
- Peter Ehlers(关于使用$ -extraction)
R-help(2013年3月)
[
或 [[
)根据存储在变量名中的字符串获取列。这是通过用户指定的列名使功能更灵活的最简单方法。例如,下面是一个简单的愚蠢函数,用于测试数据框是否具有给定名称的列:
$ p $ does_col_exist_1 = function( df,col){
return(!is.null(df $ col))
}
does_col_exist_2 = function(df,col){
return(! is.null(df [[col]])
#相当于df [,col]
}
does_col_exist_1(mtcars,col =jhfa)
#[ 1] FALSE
does_col_exist_1(mtcars,col =mpg)
#[1] FALSE
does_col_exist_2(mtcars,col =jhfa)
#[ 1] FALSE
does_col_exist_2(mtcars,col =mpg)
#[1] TRUE
$
不会评估它后面的内容,不管我设置了什么值 col
当我调用函数时, df $ col
会查找一个字面上名为col
。然而,方括号会评估 col
并且看到哦嘿, col
设置为mpg
,我们来看看这个名字的列。
$
, with()
,和 subset()
,替换为 [
)。这是一个很好的机会,你只需要做。df <- data.frame(County = c(1, 45, 57),
Year = c(2002, 2003, 2003),
Level = c("Mean", "Mean", "Mean"),
Obs = c(1.4, 1.9, 10.2))
#Create expanded version of data frame
Counties <- seq(from = 1, to = 77, by = 2)
Years <- seq(from = 1999, to = 2014, by = 1)
Levels <- c("Max", "Mean")
Expansion <- expand.grid(Counties, Years, Levels)
Expansion[4] <- 0
colnames(Expansion) <- colnames(df)
#Merge and order them so that the observed value is on top
df_full <- merge(Expansion, df, all = TRUE)
df_full$duplicate <- with(df_full,
paste(Year, County, Level))
df_full <- df_full[order(df_full$Year,
df_full$County,
df_full$Level,
-abs(df_full$Obs)), ]
#Deduplicate by taking the first that shows up (the observation)
df_full <- df_full[ !duplicated(df_full$duplicate), ]
df_full$duplicate <- NULL
gridExpand <- function(df, col1, col2=NULL, col3=NULL, measure){
#Started with "Expansion" being a global outside of the function
#It is identical the first part of the above code
ex <- merge(Expansion, df, all = TRUE)
ex$dupe <- with(ex,
paste(col1, col2, col3))
ex <- ex[order(with(ex,
col1, col2, col3, -abs(measure)))]
ex <- ex[ !duplicated(ex$dupe)]
ex <- subset(ex, select = -(dupe))
}
df_full <- gridExpand(df, Year, County, Level, Obs)
Error in paste(col1, col2, col3) : object 'Year' not found
'Year'
came from. I could potentially try paste(df, "$Year")
but it would create "df$Year"
which obviously will not work. And I do not ever see anyone else do this in their functions so clearly I am missing how it is that people reference things in data frame relevant functions. $
, subset()
and with()
. These are functions intended for convenient interactive use, not extensible programmatic use. (See, e.g., the Warning in ?subset
which should probably be added to ?with
, fortunes::fortune(312)
, fortunes::fortune(343)
.)fortunes::fortune(312)
y
via x$y
rather than x[[y]]
)
R-help (February 2012)fortunes::fortune(343)
[[
and [
habit early.
-- Peter Ehlers (about the use of $-extraction)
R-help (March 2013)[
or [[
to get the column based on the string stored in a variable name. This is the simplest way to make functions flexible with user-specified column names. For example, here's a simple stupid function that tests if a data frame has a column of the given name:does_col_exist_1 = function(df, col) {
return(!is.null(df$col))
}
does_col_exist_2 = function(df, col) {
return(!is.null(df[[col]])
# equivalent to df[, col]
}
does_col_exist_1(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_1(mtcars, col = "mpg")
# [1] FALSE
does_col_exist_2(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_2(mtcars, col = "mpg")
# [1] TRUE
$
doesn't evaluate what comes after it, no matter what value I set col
to when I call the function, df$col
will look for a column literally named "col"
. The brackets, however, will evaluate col
and see "oh hey, col
is set to "mpg"
, let's look for a column of that name."$
, with()
, and subset()
, replacing with [
. There's a pretty good chance that's all you need to do.