学习在R中编写函数 [英] Learning to write functions in R

查看:99
本文介绍了学习在R中编写函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R的角色,我想开始编写自己的函数,因为我倾向于需要一遍又一遍地做同样的事情。然而,我正在努力看到我能如何概括我写的东西。查看源代码并没有帮助我学习得非常好,因为经常它看起来很像。内部或.Primitive函数(或其他我不知道的函数)被广泛使用。我想简单地将我的正常复制粘贴解决方案转变为功能 - 更奇特的事情可能会晚一些!



作为一个例子:我做了很多数据格式化需要进行一些操作,然后为所有其他没有任何数据的组合(例如,没有观察并因此未被记录的年份等)填充零的数据帧。我需要对具有不同变量集的不同数据集进行反复处理,但想法和实现始终是相同的。



我解决这个问题的非函数方式(对于特定实现和最小示例):

<$ p数据框架(县= c(1,45,57),
年= c(2002,2003,2003),
等级= c (平均值,平均值,平均值),
Obs = c(1.4,1.9,10.2))

#创建数据帧的扩展版本
县< ; - seq(从= 1到= 77,by = 2)
年份< - seq(从= 1999到= 2014,by = 1)
级别<-c(Max ,Mean)
扩展< - expand.grid(县,年,层次)
扩展[4] < - 0
colnames(扩展)< - colnames(df )

#对它们进行合并和排序,以使得观察值位于顶部
df_full< - merge(Expansion,df,all = TRUE)
df_full $ duplicate< - (df_full,
paste(Year,County,Level))

df_full< - df_full [order(df_full $ Year,
df_full $ County,
df_full $ Level,
-abs(df_full $ Obs)),]

#重复出现的第一个(观察)
df_full< - df_full [!重复(df_full $ duplicate),]
df_full $ duplicate< - NULL

我会喜欢概括这个,这样我可以以某种方式放入一个数据框(并且可能选择我需要排序的列,因为有时候会更改),然后获取扩展版本。我的第一个实现包含了一个参数太多的函数(数据框,然后是我想要命令/ expand.grid的所有列名),但它也不起作用:

  gridExpand<  - 函数(df,col1,col2 = NULL,col3 = NULL,measure){
#开始于Expansion是函数
#它与上面代码的第一部分相同
ex (Expansion,df,all = TRUE)
ex $ dupe < - with(ex,
$(col1,col2,col3))
ex < - ex [order(with(ex,
col1,col2,col3,-abs(measure)))]
ex < - ex [!duplicated(ex $ dupe)]
ex< - subset(ex,select = - (dupe))
}

df_full< - gridExpand (df,Year,County,Level,Obs)

粘贴错误(col1,col2,col3):找不到对象'Year'

我假设这是行不通的,因为R无法知道'Year'从。我可以尝试 paste(df,$ Year),但它会创建df $ Year,这明显不管用。我从来没有看到任何其他人在他们的功能中这么做,所以我很想念人们如何引用数据框相关功能中的东西。

理想情况下,我想知道一些可能有助于思考泛化的资源,或者如果有人能指出我正确解决这个特定问题的方向,我认为它可能会帮助我看到我做错了什么。我不知道有更好的方法来寻求帮助 - 我一直在尝试阅读大约3个月编写函数的教程,而不是点击。

解决方案

乍看之下,您可以做的最大的事情是 使用非标准()函数中的评估快捷键:像 $ subset() 。这些功能旨在便于交互使用,而不是可扩展的程序使用。 (例如,请参阅?subset 中的警告,它应该可以添加到?with fortunes :: fortune(312) fortunes :: fortune(343)。)

  fortunes :: fortune(312)




这里的问题是$符号是一个神奇的快捷方式,并且像
一样,如果使用不正确的话,任何其他魔法都可能执行编程式的
等价物变成蟾蜍。 - Greg Snow(在
响应一个用户想要通过访问名为
的列存储在 y 中的列) x $ y
而不是 x [[y]]
R-help(2012年2月)



  fortunes :: fortune(343)




大多数R初学者迟早会被这个太方便的快捷键咬住。作为一个R
新手,将R视为您的银行账户:过度使用$ -extraction可能导致不良的
后果。最好先获得 [[ [习惯)
- Peter Ehlers(关于使用$ -extraction)
R-help(2013年3月)

当你开始编写在数据框架上工作的函数时,如果你需要引用列名,你应该把它们作为字符串传入,然后使用 [ [[)根据存储在变量名中的字符串获取列。这是通过用户指定的列名使功能更灵活的最简单方法。例如,下面是一个简单的愚蠢函数,用于测试数据框是否具有给定名称的列:

$ p $ does_col_exist_1 = function( df,col){
return(!is.null(df $ col))
}

does_col_exist_2 = function(df,col){
return(! is.null(df [[col]])
#相当于df [,col]
}

这些收益率:

  does_col_exist_1(mtcars,col =jhfa)
#[ 1] FALSE
does_col_exist_1(mtcars,col =mpg)
#[1] FALSE

does_col_exist_2(mtcars,col =jhfa)
#[ 1] FALSE
does_col_exist_2(mtcars,col =mpg)
#[1] TRUE

第一个函数是错误的,因为 $ 不会评估它后面的内容,不管我设置了什么值 col 当我调用函数时, df $ col 会查找一个字面上名为col 。然而,方括号会评估 col 并且看到哦嘿, col 设置为mpg,我们来看看这个名字的列。



如果您想更多地了解这个问题,我建议您使用 Hadley Wickham的高级R书的非标准评估部分 我不打算重新编译它,编写和调试你的函数,但如果我想要第一步将删除所有 $ with() ,和 subset(),替换为 [)。这是一个很好的机会,你只需要做。


I am the point with R where I would like to start writing my own functions because I tend to need to do the same things over and over. However, I am struggling to see how I can generalize what I write. Looking at source code has not helped me learn very well because often it seems that .Internal or .Primitive functions (or other commands I do not know) are used extensively. I would like to simply start by turning my normal copy-pasted solutions into functions - fancier things can come later!

As an example: I do a lot of data formatting that requires doing some operation, and then filling in a data frame with zeros for all other combinations that did not have any data (e.g., years that did not have observations and were therefore not originally recorded, etc). I need to do this over and over for different data sets that have different sets of variables, but the idea and implementation is always the same.

My non-function way of solving this has been (for a specific implementation and minimal example):

df <- data.frame(County = c(1, 45, 57),
                 Year = c(2002, 2003, 2003),
                 Level = c("Mean", "Mean", "Mean"),
                 Obs = c(1.4, 1.9, 10.2))

#Create expanded version of data frame
Counties <- seq(from = 1, to = 77, by = 2)
Years <- seq(from = 1999, to = 2014, by = 1)
Levels <- c("Max", "Mean")
Expansion <- expand.grid(Counties, Years, Levels)
Expansion[4] <- 0
colnames(Expansion) <- colnames(df)

#Merge and order them so that the observed value is on top
df_full <- merge(Expansion, df, all = TRUE)
df_full$duplicate <- with(df_full,
                          paste(Year, County, Level))

df_full <- df_full[order(df_full$Year,
                         df_full$County,
                         df_full$Level,
                         -abs(df_full$Obs)), ]

#Deduplicate by taking the first that shows up (the observation)
df_full <- df_full[ !duplicated(df_full$duplicate), ]
df_full$duplicate <- NULL

I would like to generalize this so that I could somehow put in a data frame (and probably select the columns I need to order by since that sometimes changes) and then get the expanded version out. My first implementation consisted of a function with too many arguments (the data-frame and then all the column names I wanted to order/expand.grid by) and it also did not work:

gridExpand <- function(df, col1, col2=NULL, col3=NULL, measure){
  #Started with "Expansion" being a global outside of the function 
  #It is identical the first part of the above code
  ex <- merge(Expansion, df, all = TRUE)
  ex$dupe <- with(ex,
                 paste(col1, col2, col3))
   ex <- ex[order(with(ex,
                       col1, col2, col3, -abs(measure)))]
   ex <- ex[ !duplicated(ex$dupe)]
   ex <- subset(ex, select = -(dupe))  
}

df_full <- gridExpand(df, Year, County, Level, Obs)

Error in paste(col1, col2, col3) : object 'Year' not found

I am assuming that this did not work because R has no way to know where 'Year' came from. I could potentially try paste(df, "$Year") but it would create "df$Year" which obviously will not work. And I do not ever see anyone else do this in their functions so clearly I am missing how it is that people reference things in data frame relevant functions.

I would ideally like to know of some resources that could help with thinking about generalization, or if someone can point me in the right direction to solving this particular problem I think it might help me see what I am doing wrong. I do not know of a better way to ask for help - I have been trying to read tutorials on writing functions for about 3 months and it is not clicking.

解决方案

At a glance, the biggest thing that you can do is to not use non-standard-evaluation shortcuts inside your functions: things like $, subset() and with(). These are functions intended for convenient interactive use, not extensible programmatic use. (See, e.g., the Warning in ?subset which should probably be added to ?with, fortunes::fortune(312), fortunes::fortune(343).)

fortunes::fortune(312)

The problem here is that the $ notation is a magical shortcut and like any other magic if used incorrectly is likely to do the programmatic equivalent of turning yourself into a toad. -- Greg Snow (in response to a user that wanted to access a column whose name is stored in y via x$y rather than x[[y]]) R-help (February 2012)

fortunes::fortune(343)

Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable consequences. It's best to acquire the [[ and [ habit early. -- Peter Ehlers (about the use of $-extraction) R-help (March 2013)

When you start writing functions that work on data frames, if you need to reference column names you should pass them in as strings, and then use [ or [[ to get the column based on the string stored in a variable name. This is the simplest way to make functions flexible with user-specified column names. For example, here's a simple stupid function that tests if a data frame has a column of the given name:

does_col_exist_1 = function(df, col) {
    return(!is.null(df$col))
}

does_col_exist_2 = function(df, col) {
    return(!is.null(df[[col]])
    # equivalent to df[, col]
}

These yield:

does_col_exist_1(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_1(mtcars, col = "mpg")
# [1] FALSE

does_col_exist_2(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_2(mtcars, col = "mpg")
# [1] TRUE

The first function is wrong because $ doesn't evaluate what comes after it, no matter what value I set col to when I call the function, df$col will look for a column literally named "col". The brackets, however, will evaluate col and see "oh hey, col is set to "mpg", let's look for a column of that name."

If you want lots more understanding of this issue, I'd recommend the Non-Standard Evaluation Section of Hadley Wickham's Advanced R book.

I'm not going to re-write and debug your functions, but if I wanted to my first step would be to remove all $, with(), and subset(), replacing with [. There's a pretty good chance that's all you need to do.

这篇关于学习在R中编写函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆