当数据集在sparklyr中时,为什么不能对dplyr使用双冒号运算符? [英] Why can't I use double colon operator with dplyr when the dataset is in sparklyr?

查看:117
本文介绍了当数据集在sparklyr中时,为什么不能对dplyr使用双冒号运算符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可重现的示例(改编自@forestfanjoe的答案):

A reproducible example (adapted from @forestfanjoe's answer):

library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")

df <- data.frame(id = 1:100, PaymentHistory = runif(n = 100, min = -1, max = 2))

df <- copy_to(sc, df, "payment")

> head(df)
# Source: spark<?> [?? x 2]
     id PaymentHistory
* <int>          <dbl>
1     1         -0.138
2     2         -0.249
3     3         -0.805
4     4          1.30 
5     5          1.54 
6     6          0.936

fix_PaymentHistory <- function(df){df %>% dplyr::mutate(PaymentHistory = dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > 1,1, PaymentHistory)))}

df %>% fix_PaymentHistory

错误是:

 Error in dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory >  : 
 object 'PaymentHistory' not found 

我正在使用范围运算符,因为我担心 dplyr 中的名称将与某些用户定义的代码冲突。请注意, PaymentHistory 是<$中的列变量c $ c> df 。

I'm using the scope operator because I'm afraid that the name in dplyr will clash with some of the user-defined code. Note that PaymentHistory is a column variable in df.

运行以下代码时,不会出现相同的错误:

The same error is not present when running the following code:

fix_PaymentHistory <- function(df){
    df %>% mutate(PaymentHistory = if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
}
> df %>% fix_PaymentHistory
# Source: spark<?> [?? x 2]
      id PaymentHistory
 * <int>          <dbl>
 1     1         0     
 2     2         0     
 3     3         0     
 4     4         1     
 5     5         1     
 6     6         0.936 
 7     7         0     
 8     8         0.716 
 9     9         0     
10    10         0.0831
# ... with more rows


推荐答案

TL; DR 因为您的代码未使用 dplyr :: if_else

TL;DR Because your code doesn't use dplyr::if_else at all.

sparklyr ,在示例中使用时,将Spark视为另一个数据库并使用 dbplyr SQL翻译层发出查询

sparklyr, when used as in the example, treats Spark as yet another database and issues queries using dbplyr SQL translation layer.

在这种情况下, if_else 不会被视为函数,而是会转换为标识符到SQL原语:

In this context if_else is no treated as a function, but an identifier which is converted to SQL primitives:

dbplyr::translate_sql(if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
# <SQL> CASE WHEN ("PaymentHistory" < 0.0) THEN (0.0) WHEN NOT("PaymentHistory" < 0.0) THEN (CASE WHEN ("PaymentHistory" > 1.0) THEN (1.0) WHEN NOT("PaymentHistory" > 1.0) THEN ("PaymentHistory") END) END

但是,如果您通过了完全限定的名称,它将绕过这种机制,尝试评估函数,最终失败,因为数据库列不在范围内。

However if you pass a fully qualified named, it will circumvent this mechanism, try to evaluate the function, and ultimately fail, because the database columns are not in the scope.


我担心dplyr中的名称将与某些用户定义的代码发生冲突。

I'm afraid that the name in dplyr will clash with some of the user-defined code.

如您所见,不需要使用dplyr范围内的所有内容-在 sparklyr 管道中调用的函数将转换为相应的SQL构造,或者如果没有适当的转换规则,则按原样传递并由Spark SQL引擎(此路径用于调用 Spark函数)。

As you see, there is no need for dplyr to be in scope here at all - functions called in sparklyr pipelines are either translated to corresponding SQL constructs, or if there is no specific translation rule in place, passed as-is and resolved by Spark SQL engine (this path is used to invoke Spark functions).

当然,此机制并不特定于 sparklyr ,您可以使用数据库支持的其他表很可能会看到相同的行为:

Of course this mechanism is not specific to sparklyr and you're likely to see the same behavior using other table backed by a database:

library(magrittr)

db <- dplyr::src_sqlite(":memory:", TRUE)
dplyr::copy_to(db, mtcars)

db %>% dplyr::tbl("mtcars") %>% dplyr::mutate(dplyr::if_else(mpg < 20, 1, 0))



Error in dplyr::if_else(mpg < 20, 1, 0) : object 'mpg' not found

这篇关于当数据集在sparklyr中时,为什么不能对dplyr使用双冒号运算符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆