闭包作为数据合并习语的解决方案 [英] Closures as solution to data merging idiom

查看:92
本文介绍了闭包作为数据合并习语的解决方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想把封头包裹在封口上,我会认为我发现他们可能会有所帮助。



我有以下工作:




  • 一组用于清除函数中的状态名称的正则表达式

  • 一个data.frame,带有状态名称(上述函数创建的标准化形式)和状态ID代码,用于链接这两个(合并映射)



这个想法是,给定一些data.frame,它们具有简明的状态名称(即华盛顿特区,华盛顿特区,哥伦比亚特区 ,等等?),有一个单一的函数返回相同的data.frame删除状态名称列,只剩下状态ID代码。



我可以用任何方式做到这一点,但似乎特别优雅的一种方式是将合并映射和正则表达式和代码过程一个闭包内的一切(根据闭包是一个带数据的函数的想法)。



问题1:这是一个合理的想法吗?问题2:如果是这样,在R中怎么做?



这里有一个简单的清除状态名称函数

  cleanStateNames<  -  function(x){
x< - tolower )
x [grepl(columbia,x)]< - DC
x
}

这里是一些示例数据,最终的函数将运行:

  dat <结构(列表(状态= c(阿拉巴马州,阿拉斯加,亚利桑那州,阿肯色州,
加利福尼亚州,科罗拉多州,康涅狄格州,特拉华州
Florida),pop08 = structure(c(29L,44L,40L,18L,25L,30L,
22L,48L,36L,13L),.Label = c(1,050,788 1,288,198,1,315,809,
1,316,456,1,523,816,1,783,432,1,814,468,1,984,356,
10,003,422,11,485,910,12,448,279,12,901,563 ,18,328,340,
19,490,297,2,600,167,2,736,424,2,802,134,2,855,390,
2,938,618,24,326,974,3,002,555,3,501,252 3,642,361,
3,790,060,36,756,666,4,269,245,4,410,796,4,479,800,
4,661,900,4,939,456,5,220,393,5,627,967,5,633,597 ,
5,911,605,532,668,591,833,6,214,888,6,376,792,
6,497,967,6,500,180,6,549,224,621,270,641,481 b $ b686,293,7,769,089,8,682,661,804,194,873,092,9,222,414,
9,685,744,967,440),class =factor)) = c(state,
pop08),row.names = c(NA,10L),class =data.frame)

和一个示例合并映射(实际的链接FIPS代码到状态,所以它不能简单生成):

  merge_map<  -  data.frame(state = dat $ state,id = seq(10))

EDIT 以下是crippledlambda的答案,下面是一个函数的尝试:

  prepForMerge<  -  local({
merge_map< - structure(list(state = c(alabama,alaska,arizona,arkansas,california ,科罗拉多州,康涅狄格州,特拉华州,DC,佛罗里达),id = 1:10).Names = c(state,id),row.names = c NA,-10L),class =data.frame)
list(
replace_merge_map = function(new_merge_map){
merge_map<< - new_merge_map
},
show_merge_map = function(){
merge_map
},
return_prepped_data.frame = function(dat){
dat $ state< - cleanStateNames(dat $ state)
dat < - merge(dat,merge_map)
dat < - subset(dat,select = c(-state))
dat
}
)$ b b})

> prepForMerge $ return_prepped_data.frame(dat)
pop08 id
1 4,661,900 1
2 686,293 2
3 6,500,180 3
4 2,855,390 4
5 36,756,666 5
6 4,939,456 6
7 3,501,252 7
8 591,833 9
9 873,092 8
10 18,328,340 10

在我考虑这个问题解决之前,还有两个问题:


  1. 调用 prepForMerge $ return_prepped_data.frame(dat)每次都是痛苦的。任何方式有一个默认的功能,使我可以只调用prepForMerge(dat)?


  2. 我如何避免混合数据?和代码在merge_map定义?理想情况下,我会清理merge_map在其他地方,然后只是抓住它内部的闭包并存储。



解决方案

我可能会漏掉你的问题,但这是一种可以使用闭包的方法:

 > replaceStateNames < -  local({
+ statenames< - c(Alabama,Alaska,Arizona,Arkansas,
+California,Colorado,Connecticut ,Delaware,
+哥伦比亚特区,佛罗里达)
+函数(patt,newtext){
+ statenames< - tolower(statenames)
+ statenames [grepl(patt,statenames)]< - newtext
+ statenames
+}
+})
>
> replaceStateNames(columbia,DC)
[1]alabamaalaskaarizonaarkansascalifornia
[6]coloradoconnecticut florida
> replaceStateNames(alaska,palincountry)
[1]alabamapalincountryarizona
[4]arkansascaliforniacolorado
[7] connecticutdelaware哥伦比亚区
[10]florida
> replaceStateNames(florida,jebbushland)
[1]alabamaalaskaarizona
[4]arkansascaliforniacolorado
[7] connecticutdelaware哥伦比亚区
[10]jebbushland
>

但是为了概括,你可以替换 statenames 与您的数据帧定义,并返回使用此数据帧的函数(或函数列表),而不必将其作为参数传递给函数调用。示例(但请注意,我已使用 grepl 中的 ignore.case = TRUE 参数):

 > replaceStateNames < -  local({
+ statenames< - c(Alabama,Alaska,Arizona,Arkansas,
+California,Colorado,Connecticut ,Delaware,
+哥伦比亚特区,佛罗里达州)
+ list(justreturn = function(patt,newtext){
+ statenames [grepl(patt,statenames,ignore .case = TRUE)] < - newtext
+ statenames
+},reassign = function(patt,newtext){
+ statenames< ,statenames,ignore.case = TRUE),newtext)
+ statenames
+})
+})

就像第一个例子:

  replaceStateNames $ justreturn(columbia,DC)
[1]阿拉巴马阿拉斯加亚利桑那阿肯色州加利福尼亚
[6]科罗拉多州康涅狄格州 DCFlorida

只返回 以检查原始值是否未更改:

 > replaceStateNames $ justreturn(shouldnotmatch,anythinghere)
[1]阿拉巴马阿拉斯加亚利桑那州
[4]阿肯色州加利福尼亚州科罗拉多州
[7] ]ConnecticutDelaware哥伦比亚特区
[10]佛罗里达州

做同样的事情,但改变永久:

 > replaceStateNames $ reassign(columbia,DC)
[1]阿拉巴马州阿拉斯加州亚利桑那州阿肯色州加利福尼亚州
[6]科罗拉多州 DCFlorida

注意 statenames 附加到这些功能已更改。

  replaceStateNames $ justreturn(shouldnotmatch,anythinghere)
[1]AlabamaAlaskaArizonaArkansasCalifornia
[6]Colorado DCFlorida

无论如何,您可以替换 statenames 有一个数据框,这些简单的函数带有一个合并映射或你想要的任何其他映射。



>



说到合并,这是您要找的吗?使用闭包实现第一个?merge 示例:

 作者<  -  data.frame(surname = I(c(Tukey,Venables,Tierney,Ripley,McNeil)),
+ nationality = c(US澳大利亚,美国,英国,澳大利亚),
+ deceased = c(yes,rep(no,4)))
> book< - data.frame(name = I(c(Tukey,Venables,Tierney,
+Ripley,Ripley,McNeil,R Core
+ title = c(Exploratory Data Analysis,
+Modern Applied Statistics ...,
+LISP-STAT,
+Spatial Statistics 随机模拟,
+交互式数据分析,
+An Introduction to R),
+ other.author = c(NA,Ripley,NA, NA,NA,
+Venables& Smith))
>
> mergewithauthors < - with(list(authors = authors),function(books)
+ merge(authors,books,by.x =surname,by.y =name))
> ;
> mergedithauthors(books)
姓氏国籍deceased标题other.author
1 McNeil Australia无交互式数据分析< NA>
2 Ripley UK没有空间统计< NA>
3 Ripley UK no Stochastic Simulation< NA>
4 Tierney US no LISP-STAT< NA>
5 Tukey US是探索性数据分析< NA>
6 Venables澳大利亚没有现代应用统计... Ripley

/ strong>



要将文件读入将被词法约束的对象,可以执行

  fn < -  local({
data< - read.csv(filename.csv)
function(...){
.. 。
}
})

  fn < -  with(list(data = read.csv(filename.csv)),
function(...){
...
}
})

>

  fn < -  with(local(data<  -  read.csv(filename.csv)),
function (...){
...
}
})

等。 (我假设函数(...)将与你的merge_map)。您还可以使用 evalq 替换 local 。要引入驻留在全局空间(或封闭环境)中的对象,您只需执行以下操作:

  globalobj< -  value ##可以来自read.csv()
fn < - local({
localobj< - globalobj ## if globalobj is not locally defined,
## R will look在封装环境中
##在这种情况下,globalenv()
函数(...){
...
}
})



然后修改 globalobj 后不会更改 localobj 附加到函数(因为几乎(?)R中的所有内容都遵循pass-by-value语义)。您还可以使用替换而不是 local ,如上面的示例所示。


I'm trying to wrap my head around closures, and I think I've found a case where they might be helpful.

I have the following pieces to work with:

  • A set of regular expressions designed to clean state names, housed in a function
  • A data.frame with state names (of the standardized form that the function above creates) and state ID codes, to link the two (the "merge map")

The idea is, given some data.frame with sloppy state names (is the capital listed as "Washington, D.C.", "washington DC", "District of Columbia", etc.?), to have a single function return the same data.frame with the state name column removed and only the state ID codes remaining. Then subsequent merges can happen consistently.

I can do this in any number of ways, but one way that seems to be particularly elegant would be to house the merge map and the regular expression and the code process everything inside a closure (following the idea that a closure is a function with data).

Question 1: Is this a reasonable idea?

Question 2: If so, how do I do it in R?

Here's a stupid simple clean state names function that works on the example data:

cleanStateNames <- function(x) {
  x <- tolower(x)
  x[grepl("columbia",x)] <- "DC"
  x
}

Here's some example data that the eventual function will be run on:

dat <- structure(list(state = c("Alabama", "Alaska", "Arizona", "Arkansas", 
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia", 
"Florida"), pop08 = structure(c(29L, 44L, 40L, 18L, 25L, 30L, 
22L, 48L, 36L, 13L), .Label = c("1,050,788", "1,288,198", "1,315,809", 
"1,316,456", "1,523,816", "1,783,432", "1,814,468", "1,984,356", 
"10,003,422", "11,485,910", "12,448,279", "12,901,563", "18,328,340", 
"19,490,297", "2,600,167", "2,736,424", "2,802,134", "2,855,390", 
"2,938,618", "24,326,974", "3,002,555", "3,501,252", "3,642,361", 
"3,790,060", "36,756,666", "4,269,245", "4,410,796", "4,479,800", 
"4,661,900", "4,939,456", "5,220,393", "5,627,967", "5,633,597", 
"5,911,605", "532,668", "591,833", "6,214,888", "6,376,792", 
"6,497,967", "6,500,180", "6,549,224", "621,270", "641,481", 
"686,293", "7,769,089", "8,682,661", "804,194", "873,092", "9,222,414", 
"9,685,744", "967,440"), class = "factor")), .Names = c("state", 
"pop08"), row.names = c(NA, 10L), class = "data.frame")

And a sample merge map (the actual one links FIPS codes to states, so it can't be trivially generated):

merge_map <- data.frame(state=dat$state, id=seq(10) )

EDIT Building off of crippledlambda's answer below, here's an attempt at the function:

prepForMerge <- local({
  merge_map <- structure(list(state = c("alabama", "alaska", "arizona", "arkansas",  "california", "colorado", "connecticut", "delaware", "DC", "florida" ), id = 1:10), .Names = c("state", "id"), row.names = c(NA, -10L ), class = "data.frame")
  list(
    replace_merge_map=function(new_merge_map) {
      merge_map <<- new_merge_map
    },
    show_merge_map=function() {
      merge_map
    },
    return_prepped_data.frame=function(dat) {
      dat$state <- cleanStateNames(dat$state)
      dat <- merge(dat,merge_map)
      dat <- subset(dat,select=c(-state))
      dat
    }
  )
})

> prepForMerge$return_prepped_data.frame(dat)
        pop08 id
1   4,661,900  1
2     686,293  2
3   6,500,180  3
4   2,855,390  4
5  36,756,666  5
6   4,939,456  6
7   3,501,252  7
8     591,833  9
9     873,092  8
10 18,328,340 10

Two problems remain before I'd consider this question solved:

  1. Calling prepForMerge$return_prepped_data.frame(dat) is painful each time. Any way to have a default function such that I could just call prepForMerge(dat)? I'm guessing not given how it's implemented, but perhaps there's at least a convention for the default fxn....

  2. How do I avoid mixing the data and code in the merge_map definition? Ideally I'd clean merge_map elsewhere, then just grab it inside the closure and store that.

解决方案

I may be missing the point of your question, but this is one way in which you can use a closure:

> replaceStateNames <- local({
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   function(patt,newtext) {
+     statenames <- tolower(statenames)
+     statenames[grepl(patt,statenames)] <- newtext
+     statenames
+   }
+ })
> 
> replaceStateNames("columbia","DC")
 [1] "alabama"     "alaska"      "arizona"     "arkansas"    "california" 
 [6] "colorado"    "connecticut" "delaware"    "DC"          "florida"    
> replaceStateNames("alaska","palincountry")
 [1] "alabama"              "palincountry"         "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "florida"             
> replaceStateNames("florida","jebbushland")
 [1] "alabama"              "alaska"               "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "jebbushland"    
> 

But to generalize, you can replace statenames with your data frame definition, and return a function (or list of functions) which uses this data frame without having to pass it as an argument to the function call. Example (but note I've used the ignore.case=TRUE argument in grepl):

> replaceStateNames <- local({
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   list(justreturn=function(patt,newtext) {
+     statenames[grepl(patt,statenames,ignore.case=TRUE)] <- newtext
+     statenames
+   },reassign=function(patt,newtext) {
+     statenames <<- replace(statenames,grepl(patt,statenames,ignore.case=TRUE),newtext)
+     statenames
+   })
+ })

Just like the first example:

> replaceStateNames$justreturn("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

Just returns the lexically-scoped value of statenames to check that the original values are unchanged:

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"              "Alaska"               "Arizona"             
 [4] "Arkansas"             "California"           "Colorado"            
 [7] "Connecticut"          "Delaware"             "District of Columbia"
[10] "Florida"             

Do the same thing, but make the change "permanent":

> replaceStateNames$reassign("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

And note that the value of statenames attached to these functions has changed.

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

In any case, you can replace statenames with a data frame, and these simple functions with a "merge map" or any other mapping you desire.

Edit

Speaking of "merge", is this what you're looking for? An implementation of first ?merge example using a closure:

> authors <- data.frame(surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+                       nationality = c("US", "Australia", "US", "UK", "Australia"),
+                       deceased = c("yes", rep("no", 4)))
> books <- data.frame(name = I(c("Tukey", "Venables", "Tierney",
+                       "Ripley", "Ripley", "McNeil", "R Core")),
+                     title = c("Exploratory Data Analysis",
+                       "Modern Applied Statistics ...",
+                       "LISP-STAT",
+                       "Spatial Statistics", "Stochastic Simulation",
+                       "Interactive Data Analysis",
+                       "An Introduction to R"),
+                     other.author = c(NA, "Ripley", NA, NA, NA, NA,
+                       "Venables & Smith"))
> 
> mergewithauthors <- with(list(authors=authors),function(books) 
+   merge(authors, books, by.x = "surname", by.y = "name"))
> 
> mergewithauthors(books)
   surname nationality deceased                         title other.author
1   McNeil   Australia       no     Interactive Data Analysis         <NA>
2   Ripley          UK       no            Spatial Statistics         <NA>
3   Ripley          UK       no         Stochastic Simulation         <NA>
4  Tierney          US       no                     LISP-STAT         <NA>
5    Tukey          US      yes     Exploratory Data Analysis         <NA>
6 Venables   Australia       no Modern Applied Statistics ...       Ripley

Edit 2

To read file into an object which will be lexically bound, you can either do

fn <- local({
  data <- read.csv("filename.csv")
  function(...) {
    ...
  }
})

or

fn <- with(list(data=read.csv("filename.csv")),
     function(...) {
       ...
     }
   })

or

fn <- with(local(data <- read.csv("filename.csv")),
     function(...) {
       ...
     }
   })

and so on. (I assume the function(...) will have to do with your "merge_map"). You can also use evalq in place of local. To "bring in" objects residing in the global space (or enclosing environment), you can just do the following

globalobj <- value      ## could be from read.csv()
fn <- local({
  localobj <- globalobj ## if globalobj is not locally defined, 
                        ## R will look in enclosing environment
                        ## in this case, the globalenv()
  function(...) {
    ...
  }
})

then modifying globalobj later will not change localobj attached to the function (since almost(?) everything in R follows pass-by-value semantics). You can also use with instead of local as shown in examples above.

这篇关于闭包作为数据合并习语的解决方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆