当数据集在sparklyr中时,为什么不能对dplyr使用双冒号运算符? [英] Why can't I use double colon operator with dplyr when the dataset is in sparklyr?
问题描述
可重现的示例(改编自@forestfanjoe的答案):
A reproducible example (adapted from @forestfanjoe's answer):
library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")
df <- data.frame(id = 1:100, PaymentHistory = runif(n = 100, min = -1, max = 2))
df <- copy_to(sc, df, "payment")
> head(df)
# Source: spark<?> [?? x 2]
id PaymentHistory
* <int> <dbl>
1 1 -0.138
2 2 -0.249
3 3 -0.805
4 4 1.30
5 5 1.54
6 6 0.936
fix_PaymentHistory <- function(df){df %>% dplyr::mutate(PaymentHistory = dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > 1,1, PaymentHistory)))}
df %>% fix_PaymentHistory
错误是:
Error in dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > :
object 'PaymentHistory' not found
我正在使用范围运算符,因为我担心 dplyr
中的名称将与某些用户定义的代码冲突。请注意, PaymentHistory
是<$中的列变量c $ c> df 。
I'm using the scope operator because I'm afraid that the name in dplyr
will clash with some of the user-defined code. Note that PaymentHistory
is a column variable in df
.
运行以下代码时,不会出现相同的错误:
The same error is not present when running the following code:
fix_PaymentHistory <- function(df){
df %>% mutate(PaymentHistory = if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
}
> df %>% fix_PaymentHistory
# Source: spark<?> [?? x 2]
id PaymentHistory
* <int> <dbl>
1 1 0
2 2 0
3 3 0
4 4 1
5 5 1
6 6 0.936
7 7 0
8 8 0.716
9 9 0
10 10 0.0831
# ... with more rows
推荐答案
TL; DR 因为您的代码未使用 dplyr :: if_else $
TL;DR Because your code doesn't use dplyr::if_else
at all.
sparklyr
,在示例中使用时,将Spark视为另一个数据库并使用 dbplyr
SQL翻译层发出查询。
sparklyr
, when used as in the example, treats Spark as yet another database and issues queries using dbplyr
SQL translation layer.
在这种情况下, if_else
不会被视为函数,而是会转换为标识符到SQL原语:
In this context if_else
is no treated as a function, but an identifier which is converted to SQL primitives:
dbplyr::translate_sql(if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
# <SQL> CASE WHEN ("PaymentHistory" < 0.0) THEN (0.0) WHEN NOT("PaymentHistory" < 0.0) THEN (CASE WHEN ("PaymentHistory" > 1.0) THEN (1.0) WHEN NOT("PaymentHistory" > 1.0) THEN ("PaymentHistory") END) END
但是,如果您通过了完全限定的名称,它将绕过这种机制,尝试评估函数,最终失败,因为数据库列不在范围内。
However if you pass a fully qualified named, it will circumvent this mechanism, try to evaluate the function, and ultimately fail, because the database columns are not in the scope.
我担心dplyr中的名称将与某些用户定义的代码发生冲突。
I'm afraid that the name in dplyr will clash with some of the user-defined code.
如您所见,不需要使用dplyr范围内的所有内容-在 sparklyr
管道中调用的函数将转换为相应的SQL构造,或者如果没有适当的转换规则,则按原样传递并由Spark SQL引擎(此路径用于调用 Spark函数)。
As you see, there is no need for dplyr to be in scope here at all - functions called in sparklyr
pipelines are either translated to corresponding SQL constructs, or if there is no specific translation rule in place, passed as-is and resolved by Spark SQL engine (this path is used to invoke Spark functions).
当然,此机制并不特定于 sparklyr
,您可以使用数据库支持的其他表很可能会看到相同的行为:
Of course this mechanism is not specific to sparklyr
and you're likely to see the same behavior using other table backed by a database:
library(magrittr)
db <- dplyr::src_sqlite(":memory:", TRUE)
dplyr::copy_to(db, mtcars)
db %>% dplyr::tbl("mtcars") %>% dplyr::mutate(dplyr::if_else(mpg < 20, 1, 0))
Error in dplyr::if_else(mpg < 20, 1, 0) : object 'mpg' not found
这篇关于当数据集在sparklyr中时,为什么不能对dplyr使用双冒号运算符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!