学习在R中编写函数 [英] Learning to write functions in R

查看：99 发布时间：2018/4/17 10:30:23 r function

本文介绍了学习在R中编写函数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是R的角色，我想开始编写自己的函数，因为我倾向于需要一遍又一遍地做同样的事情。然而，我正在努力看到我能如何概括我写的东西。查看源代码并没有帮助我学习得非常好，因为经常它看起来很像。内部或.Primitive函数（或其他我不知道的函数）被广泛使用。我想简单地将我的正常复制粘贴解决方案转变为功能 - 更奇特的事情可能会晚一些！

作为一个例子：我做了很多数据格式化需要进行一些操作，然后为所有其他没有任何数据的组合（例如，没有观察并因此未被记录的年份等）填充零的数据帧。我需要对具有不同变量集的不同数据集进行反复处理，但想法和实现始终是相同的。

我解决这个问题的非函数方式（对于特定实现和最小示例）：

<$ p数据框架（县= c（1,45,57），
年= c（2002,2003,2003），
等级= c （平均值，平均值，平均值），
Obs = c（1.4,1.9,10.2））

＃创建数据帧的扩展版本
县< ; - seq（从= 1到= 77，by = 2）
年份< - seq（从= 1999到= 2014，by = 1）
级别<-c（Max ，Mean）
扩展< - expand.grid（县，年，层次）
扩展[4] < - 0
colnames（扩展）< - colnames（df ）

＃对它们进行合并和排序，以使得观察值位于顶部
df_full< - merge（Expansion，df，all = TRUE）
df_full $ duplicate< - （df_full，
paste（Year，County，Level））

df_full< - df_full [order（df_full $ Year，
df_full $ County，
df_full $ Level，
-abs（df_full $ Obs）），]

＃重复出现的第一个（观察）
df_full< - df_full [！重复（df_full $ duplicate），]
df_full $ duplicate< - NULL

我会喜欢概括这个，这样我可以以某种方式放入一个数据框（并且可能选择我需要排序的列，因为有时候会更改），然后获取扩展版本。我的第一个实现包含了一个参数太多的函数（数据框，然后是我想要命令/ expand.grid的所有列名），但它也不起作用：

  gridExpand<  - 函数（df，col1，col2 = NULL，col3 = NULL，measure）{
＃开始于Expansion是函数
＃它与上面代码的第一部分相同
 ex （Expansion，df，all = TRUE）
 ex $ dupe < -  with（ex，
 $（col1，col2，col3））
 ex < -  ex [order（with（ex，
 col1，col2，col3，-abs（measure）））] 
 ex <  -  ex [！duplicated（ex $ dupe）] 
 ex<  -  subset（ex，select =  - （dupe））
} 
 
 df_full<  -  gridExpand （df，Year，County，Level，Obs）
 
粘贴错误（col1，col2，col3）：找不到对象'Year'

我假设这是行不通的，因为R无法知道'Year'从。我可以尝试 paste（df，$ Year），但它会创建df $ Year，这明显不管用。我从来没有看到任何其他人在他们的功能中这么做，所以我很想念人们如何引用数据框相关功能中的东西。

理想情况下，我想知道一些可能有助于思考泛化的资源，或者如果有人能指出我正确解决这个特定问题的方向，我认为它可能会帮助我看到我做错了什么。我不知道有更好的方法来寻求帮助 - 我一直在尝试阅读大约3个月编写函数的教程，而不是点击。

解决方案

乍看之下，您可以做的最大的事情是使用非标准（）函数中的评估快捷键：像 $ ， subset（）和。这些功能旨在便于交互使用，而不是可扩展的程序使用。（例如，请参阅？subset 中的警告，它应该可以添加到？with ， fortunes :: fortune（312）， fortunes :: fortune（343）。）

  fortunes :: fortune（312）

这里的问题是$符号是一个神奇的快捷方式，并且像一样，如果使用不正确的话，任何其他魔法都可能执行编程式的等价物变成蟾蜍。 - Greg Snow（在响应一个用户想要通过访问名为的列存储在 y 中的列） x $ y 而不是 x [[y]] ） R-help（2012年2月）

  fortunes :: fortune（343）

大多数R初学者迟早会被这个太方便的快捷键咬住。作为一个R 新手，将R视为您的银行账户：过度使用$ -extraction可能导致不良的后果。最好先获得 [[和 [习惯） - Peter Ehlers（关于使用$ -extraction） R-help（2013年3月）

当你开始编写在数据框架上工作的函数时，如果你需要引用列名，你应该把它们作为字符串传入，然后使用 [或 [[）根据存储在变量名中的字符串获取列。这是通过用户指定的列名使功能更灵活的最简单方法。例如，下面是一个简单的愚蠢函数，用于测试数据框是否具有给定名称的列： $ p $ does_col_exist_1 = function（ df，col）{ return（！is.null（df $ col）） } does_col_exist_2 = function（df，col）{ return（！ is.null（df [[col]]）＃相当于df [，col] }

这些收益率：

  does_col_exist_1（mtcars，col =jhfa）
＃[ 1] FALSE 
 does_col_exist_1（mtcars，col =mpg）
＃[1] FALSE 
 
 does_col_exist_2（mtcars，col =jhfa）
＃[ 1] FALSE 
 does_col_exist_2（mtcars，col =mpg）
＃[1] TRUE

第一个函数是错误的，因为 $ 不会评估它后面的内容，不管我设置了什么值 col 当我调用函数时， df $ col 会查找一个字面上名为col 。然而，方括号会评估 col 并且看到哦嘿， col 设置为mpg，我们来看看这个名字的列。

 
 
 如果您想更多地了解这个问题，我建议您使用 Hadley Wickham的高级R书的非标准评估部分。 我不打算重新编译它，编写和调试你的函数，但如果我想要第一步将删除所有 $ ， with（） ，和 subset（），替换为 [）。这是一个很好的机会，你只需要做。
 
I am the point with R where I would like to start writing my own functions because I tend to need to do the same things over and over. However, I am struggling to see how I can generalize what I write. Looking at source code has not helped me learn very well because often it seems that .Internal or .Primitive functions (or other commands I do not know) are used extensively. I would like to simply start by turning my normal copy-pasted solutions into functions - fancier things can come later!

As an example: I do a lot of data formatting that requires doing some operation, and then filling in a data frame with zeros for all other combinations that did not have any data (e.g., years that did not have observations and were therefore not originally recorded, etc). I need to do this over and over for different data sets that have different sets of variables, but the idea and implementation is always the same. 

My non-function way of solving this has been (for a specific implementation and minimal example):
df <- data.frame(County = c(1, 45, 57),
                 Year = c(2002, 2003, 2003),
                 Level = c("Mean", "Mean", "Mean"),
                 Obs = c(1.4, 1.9, 10.2))

#Create expanded version of data frame
Counties <- seq(from = 1, to = 77, by = 2)
Years <- seq(from = 1999, to = 2014, by = 1)
Levels <- c("Max", "Mean")
Expansion <- expand.grid(Counties, Years, Levels)
Expansion[4] <- 0
colnames(Expansion) <- colnames(df)

#Merge and order them so that the observed value is on top
df_full <- merge(Expansion, df, all = TRUE)
df_full$duplicate <- with(df_full,
                          paste(Year, County, Level))

df_full <- df_full[order(df_full$Year,
                         df_full$County,
                         df_full$Level,
                         -abs(df_full$Obs)), ]

#Deduplicate by taking the first that shows up (the observation)
df_full <- df_full[ !duplicated(df_full$duplicate), ]
df_full$duplicate <- NULL
I would like to generalize this so that I could somehow put in a data frame (and probably select the columns I need to order by since that sometimes changes) and then get the expanded version out. My first implementation consisted of a function with too many arguments (the data-frame and then all the column names I wanted to order/expand.grid by) and it also did not work:
gridExpand <- function(df, col1, col2=NULL, col3=NULL, measure){
  #Started with "Expansion" being a global outside of the function 
  #It is identical the first part of the above code
  ex <- merge(Expansion, df, all = TRUE)
  ex$dupe <- with(ex,
                 paste(col1, col2, col3))
   ex <- ex[order(with(ex,
                       col1, col2, col3, -abs(measure)))]
   ex <- ex[ !duplicated(ex$dupe)]
   ex <- subset(ex, select = -(dupe))  
}

df_full <- gridExpand(df, Year, County, Level, Obs)

Error in paste(col1, col2, col3) : object 'Year' not found
I am assuming that this did not work because R has no way to know where 'Year' came from. I could potentially try  paste(df, "$Year") but it would create "df$Year" which obviously will not work. And I do not ever see anyone else do this in their functions so clearly I am missing how it is that people reference things in data frame relevant functions. 

I would ideally like to know of some resources that could help with thinking about generalization, or if someone can point me in the right direction to solving this particular problem I think it might help me see what I am doing wrong. I do not know of a better way to ask for help - I have been trying to read tutorials on writing functions for about 3 months and it is not clicking. 
 解决方案 
At a glance, the biggest thing that you can do is to not use non-standard-evaluation shortcuts inside your functions: things like $, subset() and with(). These are functions intended for convenient interactive use, not extensible programmatic use. (See, e.g., the Warning in ?subset which should probably be added to ?with, fortunes::fortune(312), fortunes::fortune(343).)
fortunes::fortune(312)



  The problem here is that the $ notation is a magical shortcut and like
  any other magic if used incorrectly is likely to do the programmatic
  equivalent of turning yourself into a toad.    -- Greg Snow (in
  response to a user that wanted to access a column whose name is stored
        in y via x$y rather than x[[y]])
        R-help (February 2012)


fortunes::fortune(343)



  Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R
  newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable
  consequences. It's best to acquire the [[ and [ habit early.
     -- Peter Ehlers (about the use of $-extraction)
           R-help (March 2013)
When you start writing functions that work on data frames, if you need to reference column names you should pass them in as strings, and then use [ or [[ to get the column based on the string stored in a variable name. This is the simplest way to make functions flexible with user-specified column names. For example, here's a simple stupid function that tests if a data frame has a column of the given name:
does_col_exist_1 = function(df, col) {
    return(!is.null(df$col))
}

does_col_exist_2 = function(df, col) {
    return(!is.null(df[[col]])
    # equivalent to df[, col]
}
These yield:
does_col_exist_1(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_1(mtcars, col = "mpg")
# [1] FALSE

does_col_exist_2(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_2(mtcars, col = "mpg")
# [1] TRUE
The first function is wrong because $ doesn't evaluate what comes after it, no matter what value I set col to when I call the function, df$col will look for a column literally named "col". The brackets, however, will evaluate col and see "oh hey, col is set to "mpg", let's look for a column of that name."

If you want lots more understanding of this issue, I'd recommend the Non-Standard Evaluation Section of Hadley Wickham's Advanced R book.

I'm not going to re-write and debug your functions, but if I wanted to my first step would be to remove all $, with(), and subset(), replacing with [. There's a pretty good chance that's all you need to do.

                        这篇关于学习在R中编写函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

学习在R中编写函数 [英] Learning to write functions in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

学习在R中编写函数 [英] Learning to write functions in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭