使用 sparklyr 中的 dplyr 计算每列中唯一元素的数量 [英] count number of unique elements in each columns with dplyr in sparklyr
问题描述
我正在尝试计算 spark 数据集 s 中每列中唯一元素的数量.
I'm trying to count the number of unique elements in each column in the spark dataset s.
然而,spark 似乎无法识别 Tally()<代码>k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))错误:org.apache.spark.sql.AnalysisException:未定义的函数TALLY
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
spark 似乎也无法识别简单的 r 函数,例如unique"或length".我可以在本地数据上运行代码,但是当我尝试在 spark table 上运行完全相同的代码时,它不起作用.
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
推荐答案
请记住,当您编写 sparlyr 时,您实际上是在转换为 spark-sql,因此您可能需要不时使用 spark-sql 动词.这是像 count
和 distinct
这样的 spark-sql 动词派上用场的时候之一.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count
and distinct
come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))
这篇关于使用 sparklyr 中的 dplyr 计算每列中唯一元素的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!