R 在使用 case_when 时提供参数(R 向量化) [英] R supplying arguments while using case_when (R vectorization)

查看:23
本文介绍了R 在使用 case_when 时提供参数(R 向量化)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我之前提出的一个问题的后续问题(R 使用 case_when(R 向量化)在存在大量类别/类型时应用多个函数).不幸的是,我一直无法弄清楚问题所在.我想我可能已经缩小了问题的来源,想看看是否有比我更了解的人能帮我找出解决方案.

This is a follow up question to a question that I asked before (R apply multiple functions when large number of categories/types are present using case_when (R vectorization)). Unfortunately I have not been able to figure out the problem. I think I may have narrowed down the source of the problem an wanted to check if someone with a better understanding than me could help me figure out a solution.

假设我有以下数据集:

set.seed(100)
City=c("City1","City2","City2","City1")
Business=c("B","A","A","B")
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)

这里假设存在 2 个不同的企业,名为A"和B".进一步假设存在两个不同的城市 City1 和 City2.我的原始数据集包含大约 20 万个观测值,涉及多个企业和大约 100 个城市.对于每个城市,我都有一个独特的预先编写的函数来计算调整后的收入.不是按每个观察/行运行它们,我想使用 case_when 来运行相关城市的函数(例如,获取城市 1 的观察结果,如果可能,为城市 1 运行矢量化函数,然后移动到城市 2,依此类推).

Here suppose that there exists 2 different business named "A" and "B". Further suppose there exists two different cities City1 and City2. My original dataset contains about 200K observations with multiple Businesses and about 100 cities. For each city, I have a unique pre-written function to compute adjusted revenue. Instead of running them by each observation/row, I want to use case_when to run the function for the relevant city (for eg take the observations for City 1, run a vectorized function for city 1 if possible then move to city 2 and so on).

为了便于说明,假设我对这两个城市有以下高度简化的函数.

For the purposes of illustration, suppose I have the following highly simplified functions for the two cities.

#Writing the custom functions for the categories here
City1=function(full_data,observation){
  NewSet=full_data[which(full_data$City==observation$City),]
  BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
  return(BusinessMax)
}

City2=function(full_data,observation){
  NewSet=full_data[which(full_data$City==observation$City),]
  BusinessMax = max(NewSet$ExpectedRevenue)-1000*rnorm(1)
  return(BusinessMax)
}

这里的这些简单函数本质上是对城市数据进行子集化,并从预期收入中添加 (City1) 或减去 (City2) 一些随机数.再次重申,这些简单的功能只是为了说明,并不反映实际功能.我也手动检查,输入:

These simple functions here essentially subset the data for the city, and adds (City1) or subtracts (City2) some random number from the expected revenue. Once again, these simple functions are simply for illustration and does not reflect the actual functions. I also manually check, if the functions work by typing in:

City1(full_data = zz,observation = zz[1,])
City1(full_data = zz,observation = zz[4,]) 

并获得29.97808"和36.31531".请注意,在上述函数中,由于我添加或减去了一个随机数,因此我希望在同一城市中获得不同的两个观测值,就像我在此处获得的一样.

and get "29.97808" and "36.31531". Note that in the above functions, since I add or subtract a random number, I would expect to get different values for two observations in the same city like I have obtained here.

最后尝试使用case_when来运行代码如下:

Finally, I try to use case_when to run the code as follows:

library(dplyr) #I use dplyr here
zz[,"AdjustedRevenue"] = case_when(
  zz[["City"]]=="City1"~City1(full_data=zz,observation=zz[,]),
  zz[["City"]]=="City2"~City2(full_data=zz,observation=zz[,])
)

我收到的输出如下:

   City Business ExpectedRevenue AdjustedRevenue
1 City1        B              35        43.86785
2 City2        A              20       -81.97127
3 City2        A              15       -81.97127
4 City1        B              19        43.86785

这里,对于观察 1 和 4 &2和3,调整后的值是一样的.相反,我期望为每个观察获得不同的值(因为我为每个观察添加或删除了一些随机数;或者至少打算这样做).按照 Martin Gal 对我上一个问题的回答(https://stackoverflow.com/a/62378991/3988575),我怀疑这是因为在最后一步中没有正确调用我的 City1 和 City2 函数的第二个参数.但是,我一直在试图找出原因以及如何解决它来解决这个问题.

Here, for observations 1 and 4 & 2 and 3, the adjusted values are the same. Instead what I would expect is to obtain different values for each observation (since I add or remove some random number for each observation; or atleast intended to). Following Martin Gal's answer to my previous question (https://stackoverflow.com/a/62378991/3988575), I suspect this is due to not calling the 2nd argument of my City1 and City2 functions correctly in the final step. However, I have been somewhat lost trying to figure out why and what to do in order to fix it.

如果有人能指出发生这种情况的原因以及如何解决此错误,那将非常有帮助.提前致谢!

It'd be really helpful If someone could point out why this is happening and how to fix this error. Thanks in advance!

附言我也对其他矢量化解决方案持开放态度.我对矢量化比较陌生,在这方面没有太多经验,希望得到任何建议.

P.S. I am also open to other vectorized solutions. I am relatively new to vectorization and do not have much experience in it and would appreciate any suggestions.

推荐答案

City 函数转换为 dplyr.如果 CityMaster 对于最终功能来说过于简化,则可以将 mer 移动到 case_when 中(如适用).如果一个新的城市被添加到数据中,那么它将返回 NA 直到一个案例被定义.

Converted the City functions to dplyr. If CityMaster is too simplified for the final function then mer could be moved inside the case_when as applicable. If a new city is added to the data then it will return NA until a case is defined.

library(dplyr)
CityMaster <- function(data, city) {
  mer <- data %>%
    filter(City == city) %>%
    pull(ExpectedRevenue) %>%
    max()
  case_when(city == 'City1' ~ mer + 10 * rnorm(1),
            city == 'City2' ~ mer - 1000 * rnorm(1),
            TRUE ~ NA_real_)
}

set.seed(100)
zz %>%
  rowwise() %>%
  mutate(AdjustedRevenue = CityMaster(., City))

# A tibble: 4 x 4
# Rowwise: 
  City  Business ExpectedRevenue AdjustedRevenue
  <chr> <chr>              <dbl>           <dbl>
1 City1 B                     35            30.0
2 City2 A                     20          -867. 
3 City2 A                     15          -299. 
4 City1 B                     19            29.2

打破城市功能

City1 <- function(data, city) {
  data %>%
    filter(City == city) %>%
    pull(ExpectedRevenue) %>%
    max() + 10 * rnorm(1)
}

City2 <- function(data, city) {
  data %>%
    filter(City == city) %>%
    pull(ExpectedRevenue) %>%
    max() - 1000 * rnorm(1)
}

set.seed(100)
zz %>%
  rowwise() %>%
  mutate(AdjustRevenue = case_when(City == 'City1' ~ City1(., City),
                                   City == 'City2' ~ City2(., City),
                                   TRUE ~ NA_real_))

这篇关于R 在使用 case_when 时提供参数(R 向量化)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆