如何在组中找到第一个非空值?(使用数据集api进行二次排序) [英] How to find first non-null values in groups? (secondary sorting using dataset api)

查看:18
本文介绍了如何在组中找到第一个非空值?(使用数据集api进行二次排序)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个表示事件流的数据集(例如作为来自网站的跟踪事件触发).所有事件都有一个时间戳.我们经常遇到的一个用例是试图找到给定字段的第一个非空值.因此,例如,像这样的东西让我们最容易到达那里:

I am working on a dataset which represents a stream of events (like fired as tracking events from a website). All the events have a timestamp. One use case we often have is trying to find the 1st non null value for a given field. So for example something like gets us most the way there:

val eventsDf = spark.read.json(jsonEventsPath) 

case class ProjectedFields(visitId: String, userId: Int, timestamp: Long ... )

val projectedEventsDs = eventsDf.select(
    eventsDf("message.visit.id").alias("visitId"),
    eventsDf("message.property.user_id").alias("userId"),
    eventsDf("message.property.timestamp"),

    ...

).as[ProjectedFields]

projectedEventsDs.groupBy($"visitId").agg(first($"userId", true))

上述代码的问题在于,无法保证将数据送入first 聚合函数的顺序.我希望它按 timestamp 排序,以确保它是第一个按时间戳记的非空 userId,而不是任何随机的非空 userId.

The problem with the above code is that the order of the data being fed into that first aggregation function is not guaranteed. I would like it to be sorted by timestamp to ensure that it is the 1st non null userId by timestamp rather than any random non null userId.

有没有办法定义分组内的排序?

Is there a way to define the sorting within a grouping?

使用 Spark 2.10

Using Spark 2.10

顺便说一句,SPARK DataFrame 中为 Spark 2.10 建议的方式: 选择每组的第一行 是在分组之前进行排序 - 这不起作用.例如下面的代码:

BTW, the way suggested for Spark 2.10 in SPARK DataFrame: select the first row of each group is to do ordering before the grouping -- that doesn't work. For example the following code:

case class OrderedKeyValue(key: String, value: String, ordering: Int)
val ds = Seq(
  OrderedKeyValue("a", null, 1), 
  OrderedKeyValue("a", null, 2), 
  OrderedKeyValue("a", "x", 3), 
  OrderedKeyValue("a", "y", 4), 
  OrderedKeyValue("a", null, 5)
).toDS()

ds.orderBy("ordering").groupBy("key").agg(first("value", true)).collect()

有时会返回 Array([a,y]) 有时会返回 Array([a,x])

Will sometimes return Array([a,y]) and sometimes Array([a,x])

推荐答案

使用我心爱的 windows(...体验一下你的生活变得多么简单)

Use my beloved windows (...and experience how much simpler your life becomes !)

import org.apache.spark.sql.expressions.Window
val byKeyOrderByOrdering = Window
  .partitionBy("key")
  .orderBy("ordering")
  .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)

import org.apache.spark.sql.functions.first
val firsts = ds.withColumn("first",
  first("value", ignoreNulls = true) over byKeyOrderByOrdering)

scala> firsts.show
+---+-----+--------+-----+
|key|value|ordering|first|
+---+-----+--------+-----+
|  a| null|       1|    x|
|  a| null|       2|    x|
|  a|    x|       3|    x|
|  a|    y|       4|    x|
|  a| null|       5|    x|
+---+-----+--------+-----+

注意:不知何故,Spark 2.2.0-SNAPSHOT(今天构建)无法给我正确的答案,因为没有 rangeBetween,我认为这应该是默认的无界范围.

NOTE: Somehow, Spark 2.2.0-SNAPSHOT (built today) could not give me the correct answer with no rangeBetween which I thought should've been the default unbounded range.

这篇关于如何在组中找到第一个非空值?(使用数据集api进行二次排序)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆