用spark中的条件对数据帧的行进行计数 [英] counting rows of a dataframe with condition in spark

查看：382 发布时间：2017/3/26 4:48:20 json scala apache-spark dataframe apache-spark-sql

本文介绍了用spark中的条件对数据帧的行进行计数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试：

df=dfFromJson:
{"class":"name 1","stream":"science"}
{"class":"name 1","stream":"arts"}
{"class":"name 1","stream":"science"}
{"class":"name 1","stream":"law"}
{"class":"name 1","stream":"law"}
{"class":"name 2","stream":"science"}
{"class":"name 2","stream":"arts"}
{"class":"name 2","stream":"law"}
{"class":"name 2","stream":"science"}
{"class":"name 2","stream":"arts"}
{"class":"name 2","stream":"law"}


df.groupBy("class").agg(count(col("stream")==="science") as "stream_science", count(col("stream")==="arts") as "stream_arts", count(col("stream")==="law") as "stream_law")

这不是预期的输出，我该怎么办

This is not giving expected output, how can I achieve it in fastest way?

推荐答案

不完全清楚预期的输出是什么，但我想你想要这样的东西：

It is not exactly clear what is the expected output but I guess you want something like this:

import org.apache.spark.sql.functions.{count, col, when}

val streams = df.select($"stream").distinct.collect.map(_.getString(0))
val exprs = streams.map(s => count(when($"stream" === s, 1)).alias(s"stream_$s"))

df
  .groupBy("class")
  .agg(exprs.head, exprs.tail: _*)

// +------+--------------+----------+-----------+
// | class|stream_science|stream_law|stream_arts|
// +------+--------------+----------+-----------+
// |name 1|             2|         2|          1|
// |name 2|             2|         2|          2|
// +------+--------------+----------+-----------+

如果你不关心名字，只有一个组列，你可以简单地使用 DataFrameStatFunctions.crosstab ：

If you don't care about names and have only one group column you can simply use DataFrameStatFunctions.crosstab:

df.stat.crosstab("class", "stream")

// +------------+---+----+-------+
// |class_stream|law|arts|science|
// +------------+---+----+-------+
// |      name 1|  2|   1|      2|
// |      name 2|  2|   2|      2|
// +------------+---+----+-------+

这篇关于用spark中的条件对数据帧的行进行计数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用spark中的条件对数据帧的行进行计数 [英] counting rows of a dataframe with condition in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用spark中的条件对数据帧的行进行计数 [英] counting rows of a dataframe with condition in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭