如果存在多个条件,则SparkSQL总和 [英] SparkSQL sum if on multiple conditions

查看:86
本文介绍了如果存在多个条件,则SparkSQL总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的SparkSQL DataFrame:

I have a SparkSQL DataFrame like this:

name gender age isActive points
-------------------------------
 Bob      M  12     true    100
 Hal      M  16    false     80
 Pat      F  21     true     70
 Lin      F  17    false     40
 Zac      M  18     true     20
 Mei      F  19     true     10
 Sal      M  13    false     10

我有一些类似的功能:

def isEligible(prog: String) (name: String, gender: String, age: Int, isActive: Boolean, points: Int): Boolean

确定某人是否有资格参加特定计划.对于实例,以下函数调用将告诉我Bob是否符合Program1的条件:

that determines whether someone is eligible for a particular program. For Instance, the following function call would tell me whether Bob is eligible for Program1:

isEligible("Program1", "Bob", "M", 12, true, 100)

一个人可能有资格参加多个课程.我想编写一个使用此DataFrame并输出摘要DataFrame的函数,如下所示:

A person may be eligible for more than one program. I want to write a function that takes this DataFrame, and outputs a summary DataFrame like so:

prog1 prog2 prog3 prog4
-----------------------
    7     3     2     5

显示符合每个计划条件的人数.在Spark中执行此操作的最佳方法是什么?我知道我可以使用 struct agg 函数,但是我不知道如何将我的 isEligible 函数合并到SparkSQL查询中.

which shows the number of people who are eligible for each program. What is the best way to do this in Spark? I know I can use struct and agg functions, but I don't know how to incorporate my isEligible function into the SparkSQL query.

推荐答案

定义程序列表:

val progs = Seq("prog1", "prog2", "prog3", "prog4")

定义表达式

@transient val exprs = progs.map(p => {
  val f = udf(isEligible(p) _)
  sum(f(
    $"name", $"gender", $"age", $"isActive", $"points"
  ).cast("long")).alias(p)
})

df.select(exprs: _*)

您还可以使用强类型数据集:

You could also use strongly typed dataset:

import org.apache.spark.sql.Row

case class Record(name: String, gender: String, age: Int, 
                  isActive: Boolean, points: Int)

df.as[Record].flatMap {
   case Record(name, gender, age, isActive, points) => 
     progs.filter(p => isEligible(p)(name, gender, age, isActive, points))
}.groupBy().pivot("value", progs).count()

这篇关于如果存在多个条件,则SparkSQL总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆