唯一值数量 [英] number of unique values sparklyr
本文介绍了唯一值数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
以下示例说明了如何在不使用dplyr和sparklyr汇总行的情况下如何计算不同值的数量.
the following example describes how you can't calculate the number of distinct values without aggregating the rows using dplyr with sparklyr.
有没有解决的方法,不会破坏命令链?
is there a work around that doesn't break the chain of commands?
更一般而言,如何在Sparklyr数据帧上使用类似窗口功能的sql.
more generally, how can you use sql like window functions on sparklyr data frames.
## generating a data set
set.seed(.328)
df <- data.frame(
ids = floor(runif(10, 1, 10)),
cats = sample(letters[1:3], 10, replace = TRUE),
vals = rnorm(10)
)
## copying to Spark
df.spark <- copy_to(sc, df, "df_spark", overwrite = TRUE)
# Source: table<df_spark> [?? x 3]
# Database: spark_connection
# ids cats vals
# <dbl> <chr> <dbl>
# 9 a 0.7635935
# 3 a -0.7990092
# 4 a -1.1476570
# 6 c -0.2894616
# 9 b -0.2992151
# 2 c -0.4115108
# 9 b 0.2522234
# 9 c -0.8919211
# 6 c 0.4356833
# 6 b -1.2375384
# # ... with more rows
# using the regular dataframe
df %>% mutate(n_ids = n_distinct(ids))
# ids cats vals n_ids
# 9 a 0.7635935 5
# 3 a -0.7990092 5
# 4 a -1.1476570 5
# 6 c -0.2894616 5
# 9 b -0.2992151 5
# 2 c -0.4115108 5
# 9 b 0.2522234 5
# 9 c -0.8919211 5
# 6 c 0.4356833 5
# 6 b -1.2375384 5
# using the sparklyr data frame
df.spark %>% mutate(n_ids = n_distinct(ids))
Error: Window function `distinct()` is not supported by this database
推荐答案
最好的方法是使用count
∘distinct
:
n_ids <- df.spark %>%
select(ids) %>% distinct() %>% count() %>% collect() %>%
unlist %>% as.vector
df.spark %>% mutate(n_ids = n_ids)
或approx_count_distinct
:
n_ids_approx <- df.spark %>%
select(ids) %>% summarise(approx_count_distinct(ids)) %>% collect() %>%
unlist %>% as.vector
df.spark %>% mutate(n_ids = n_ids_approx)
这有点冗长,但是如果要使用全局无界框架,dplyr
所使用的窗口函数方法无论如何都是死胡同.
It is a bit verbose, but window function approach used by dplyr
is a dead end anyway, if you want to use global unbounded frame.
如果您想获得准确的结果,还可以:
If you want exact results you can also:
df.spark %>%
spark_dataframe() %>%
invoke("selectExpr", list("COUNT(DISTINCT ids) as cnt_unique_ids")) %>%
sdf_register()
这篇关于唯一值数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文