Sparklyr:使用group_by,然后连接来自组中行的字符串 [英] Sparklyr: Use group_by and then concatenate strings from rows in a group

查看:127
本文介绍了Sparklyr:使用group_by,然后连接来自组中行的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在sparklyr中使用group_by()和mutate()函数来连接组中的行。

I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.

这里是一个简单的示例,我认为应该工作,但不起作用:

Here is a simple example that I think should work but doesn't:

library(sparkylr)
d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), 
             x=c("200", "200", "200", "201", "201", "201"), 
             y=c("This", "That", "The", "Other", "End", "End"))
d_sdf <- copy_to(sc, d, "d")
d_sdf %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

我想要生成的是:

Source: local data frame [6 x 3]
Groups: id, x [4]

# A tibble: 6 x 3
      id      x         y
  <fctr> <fctr>     <chr>
1      1    200 This That
2      1    200 This That
3      2    200       The
4      2    201 Other End
5      1    201       End
6      2    201 Other End

我收到以下错误:

Error: org.apache.spark.sql.AnalysisException: missing ) at 'AS' near '' '' in selection target; line 1 pos 42

请注意,在data.frame上使用相同的代码可以正常工作:

Note that the using the same code on a data.frame works fine:

d %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))


推荐答案

Spark sql 会不喜欢它,因此它在 dplyr 和普通的 dataframe ,但不在 SparkDataFrame - sparklyr 中,但会将命令转换为 sql 语句。如果查看错误消息中的第二位,您会发现这是错误的:

Spark sql doesn't like it if you use aggregate functions without aggregating, hence the reason why this works in dplyr with an ordinary dataframe but not in a SparkDataFrame- sparklyr translates your commands to an sql statement. You can observe this going wrong if you look at the second bit in the error message:

== SQL ==
SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`

粘贴被翻译为 CONCAT_WS concat 但是会将粘贴在一起。

paste gets translated to CONCAT_WS. concat however would paste columns together.

更好的等效项是 collect_list collect_set ,但它们会产生 list 输出。

A better equivalent would be collect_list and collect_set, but they produce list outputs.

但是您可以在此基础上:

But you can build on that:

如果您想要如果您在结果中复制了同一行,则可以使用总结 collect_list 粘贴

If you do not want to have the same row replicated in your result you can use summarise, collect_list, and paste:

res <- d_sdf %>% 
      group_by(id, x) %>% 
      summarise( yconcat =paste(collect_list(y)))

结果:

Source:     lazy query [?? x 3]
Database:   spark connection master=local[8] app=sparklyr local=TRUE
Grouped by: id

     id     x         y
  <chr> <chr>     <chr>
1     1   201       End
2     2   201 Other End
3     1   200 This That
4     2   200       The

如果您想复制行,则可以将其重新添加到原始数据中:

you can join this back onto your original data if you do want to have your rows replicated:

d_sdf %>% left_join(res)

结果:

Source:     lazy query [?? x 4]
Database:   spark connection master=local[8] app=sparklyr local=TRUE

     id     x     y   yconcat
  <chr> <chr> <chr>     <chr>
1     1   200  This This That
2     1   200  That This That
3     2   200   The       The
4     2   201 Other Other End
5     1   201   End       End
6     2   201   End Other End

这篇关于Sparklyr:使用group_by,然后连接来自组中行的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆