Sparklyr:使用group_by,然后连接来自组中行的字符串 [英] Sparklyr: Use group_by and then concatenate strings from rows in a group
问题描述
我试图在sparklyr中使用group_by()和mutate()函数来连接组中的行。
I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.
这里是一个简单的示例,我认为应该工作,但不起作用:
Here is a simple example that I think should work but doesn't:
library(sparkylr)
d <- data.frame(id=c("1", "1", "2", "2", "1", "2"),
x=c("200", "200", "200", "201", "201", "201"),
y=c("This", "That", "The", "Other", "End", "End"))
d_sdf <- copy_to(sc, d, "d")
d_sdf %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))
我想要生成的是:
Source: local data frame [6 x 3]
Groups: id, x [4]
# A tibble: 6 x 3
id x y
<fctr> <fctr> <chr>
1 1 200 This That
2 1 200 This That
3 2 200 The
4 2 201 Other End
5 1 201 End
6 2 201 Other End
我收到以下错误:
Error: org.apache.spark.sql.AnalysisException: missing ) at 'AS' near '' '' in selection target; line 1 pos 42
请注意,在data.frame上使用相同的代码可以正常工作:
Note that the using the same code on a data.frame works fine:
d %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))
推荐答案
Spark sql 如果您使用聚合函数而不进行聚合,则code>会不喜欢它,因此它在
dplyr
和普通的 dataframe
,但不在 SparkDataFrame
- sparklyr
中,但会将命令转换为 sql
语句。如果查看错误消息中的第二位,您会发现这是错误的:
Spark sql
doesn't like it if you use aggregate functions without aggregating, hence the reason why this works in dplyr
with an ordinary dataframe
but not in a SparkDataFrame
- sparklyr
translates your commands to an sql
statement. You can observe this going wrong if you look at the second bit in the error message:
== SQL ==
SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`
粘贴
被翻译为 CONCAT_WS
。 concat
但是会将列粘贴在一起。
paste
gets translated to CONCAT_WS
. concat
however would paste columns together.
更好的等效项是 collect_list
和 collect_set
,但它们会产生 list
输出。
A better equivalent would be collect_list
and collect_set
, but they produce list
outputs.
但是您可以在此基础上:
But you can build on that:
如果您不想要如果您在结果中复制了同一行,则可以使用总结
, collect_list
和粘贴
:
If you do not want to have the same row replicated in your result you can use summarise
, collect_list
, and paste
:
res <- d_sdf %>%
group_by(id, x) %>%
summarise( yconcat =paste(collect_list(y)))
结果:
Source: lazy query [?? x 3]
Database: spark connection master=local[8] app=sparklyr local=TRUE
Grouped by: id
id x y
<chr> <chr> <chr>
1 1 201 End
2 2 201 Other End
3 1 200 This That
4 2 200 The
如果您想复制行,则可以将其重新添加到原始数据中:
you can join this back onto your original data if you do want to have your rows replicated:
d_sdf %>% left_join(res)
结果:
Source: lazy query [?? x 4]
Database: spark connection master=local[8] app=sparklyr local=TRUE
id x y yconcat
<chr> <chr> <chr> <chr>
1 1 200 This This That
2 1 200 That This That
3 2 200 The The
4 2 201 Other Other End
5 1 201 End End
6 2 201 End Other End
这篇关于Sparklyr:使用group_by,然后连接来自组中行的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!