用sparklyr中的dplyr计算每列中的唯一元素的数量 [英] count number of unique elements in each columns with dplyr in sparklyr

查看:102
本文介绍了用sparklyr中的dplyr计算每列中的唯一元素的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算spark数据集s中每一列的唯一元素的数量.

I'm trying to count the number of unique elements in each column in the spark dataset s.

但是,似乎火花无法识别tally() k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.))))) Error: org.apache.spark.sql.AnalysisException: undefined function TALLY

However It seems that spark doesn't recognize tally() k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.))))) Error: org.apache.spark.sql.AnalysisException: undefined function TALLY

似乎spark也无法识别简单的r函数,例如"unique"或"length".我可以在本地数据上运行代码,但是当我尝试在spark表上运行完全相同的代码时,它将无法正常工作.

It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.

```

d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
  group    X1    X2
  <chr> <int> <int>
1     a     5     1
2     b     5     1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;

```

推荐答案

记住,在编写sparlyr时,您实际上是在转换为spark-sql,因此您可能需要不时使用spark-sql动词.这是诸如countdistinct这样的spark-sql动词派上用场的时候之一.

Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.

library(sparkylr)

sc <- spark_connect()
iris_spk <- copy_to(sc, iris)

# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
  summarise(Species = distinct(Species))
# or
iris_spk %>%
  summarise(Species = approx_count_distinct(Species))

# this does what you are looking for
iris_spk %>% 
    group_by(species) %>%
    summarise_all(funs(n_distinct))

# for larger data sets this is much faster
iris_spk %>% 
    group_by(species) %>%
    summarise_all(funs(approx_count_distinct))

这篇关于用sparklyr中的dplyr计算每列中的唯一元素的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆