Sparklyr：使用group_by，然后连接来自组中行的字符串 [英] Sparklyr: Use group_by and then concatenate strings from rows in a group

查看：127 发布时间：2020/10/17 22:14:28 r data-science sparklyr

本文介绍了Sparklyr：使用group_by，然后连接来自组中行的字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在sparklyr中使用group_by（）和mutate（）函数来连接组中的行。

I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.

这里是一个简单的示例，我认为应该工作，但不起作用：

Here is a simple example that I think should work but doesn't:

library(sparkylr)
d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), 
             x=c("200", "200", "200", "201", "201", "201"), 
             y=c("This", "That", "The", "Other", "End", "End"))
d_sdf <- copy_to(sc, d, "d")
d_sdf %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

我想要生成的是：

Source: local data frame [6 x 3]
Groups: id, x [4]

# A tibble: 6 x 3
      id      x         y
  <fctr> <fctr>     <chr>
1      1    200 This That
2      1    200 This That
3      2    200       The
4      2    201 Other End
5      1    201       End
6      2    201 Other End

我收到以下错误：

Error: org.apache.spark.sql.AnalysisException: missing ) at 'AS' near '' '' in selection target; line 1 pos 42

请注意，在data.frame上使用相同的代码可以正常工作：

Note that the using the same code on a data.frame works fine:

d %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

推荐答案

Spark sql 会不喜欢它，因此它在 dplyr 和普通的 dataframe ，但不在 SparkDataFrame - sparklyr 中，但会将命令转换为 sql 语句。如果查看错误消息中的第二位，您会发现这是错误的：


Spark sql doesn't like it if you use aggregate functions without aggregating, hence the reason why this works in dplyr with an ordinary dataframe but not in a SparkDataFrame- sparklyr translates your commands to an sql statement. You can observe this going wrong if you look at the second bit in the error message:
== SQL ==
SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`

 粘贴被翻译为 CONCAT_WS 。  concat 但是会将列粘贴在一起。 
paste gets translated to CONCAT_WS. concat however would paste columns together. 
更好的等效项是 collect_list 和 collect_set ，但它们会产生 list 输出。 
A better equivalent would be collect_list and collect_set, but they produce list outputs. 
但是您可以在此基础上：
But you can build on that:
如果您不想要如果您在结果中复制了同一行，则可以使用总结， collect_list 和粘贴：
If you do not want to have the same row replicated in your result you can use summarise, collect_list, and paste:
res <- d_sdf %>% 
      group_by(id, x) %>% 
      summarise( yconcat =paste(collect_list(y)))

结果：
Source:     lazy query [?? x 3]
Database:   spark connection master=local[8] app=sparklyr local=TRUE
Grouped by: id

     id     x         y
  <chr> <chr>     <chr>
1     1   201       End
2     2   201 Other End
3     1   200 This That
4     2   200       The

如果您想复制行，则可以将其重新添加到原始数据中：
you can join this back onto your original data if you do want to have your rows replicated:
d_sdf %>% left_join(res)

结果：
Source:     lazy query [?? x 4]
Database:   spark connection master=local[8] app=sparklyr local=TRUE

     id     x     y   yconcat
  <chr> <chr> <chr>     <chr>
1     1   200  This This That
2     1   200  That This That
3     2   200   The       The
4     2   201 Other Other End
5     1   201   End       End
6     2   201   End Other End


                        这篇关于Sparklyr：使用group_by，然后连接来自组中行的字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Sparklyr：使用group_by，然后连接来自组中行的字符串 [英] Sparklyr: Use group_by and then concatenate strings from rows in a group

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Sparklyr：使用group_by，然后连接来自组中行的字符串 [英] Sparklyr: Use group_by and then concatenate strings from rows in a group

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭