Spark DataFrames:合并两个连续的行 [英] Spark DataFrames: Combining Two Consecutive Rows

查看:105
本文介绍了Spark DataFrames:合并两个连续的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下结构的DataFrame:

I have a DataFrame with the following structure:

|  id  |  time  |  x  |  y  |
-----------------------------
|  1   |   1    |  0  |  3  |
|  1   |   2    |  3  |  2  |
|  1   |   5    |  6  |  1  |
|  2   |   1    |  3  |  7  |
|  2   |   2    |  1  |  9  |
|  3   |   1    |  7  |  5  |
|  3   |   2    |  9  |  3  |
|  3   |   7    |  2  |  5  |
|  3   |   8    |  4  |  7  |
|  4   |   1    |  7  |  9  |
|  4   |   2    |  9  |  0  |

我要达到的目的是为每条记录创建另外三列,其中包含下一个的time, x, y(基于time).要注意的是,只有它们具有相同的id值时,我们才取下一条记录,否则应将新的三列设置为null

What I'm trying to achieve is for each record, three more columns are created containing the time, x, y of the next one (based on time). The catch is we only take the next records if they have the same id value, otherwise the new three columns should be set to null

这是我要获取的输出

|  id  |  time  |  x  |  y  | time+1 | x+1 | y+1 |
--------------------------------------------------
|  1   |   1    |  0  |  3  |   2    |  3  |  2  |
|  1   |   2    |  3  |  2  |   5    |  6  |  1  |
|  1   |   5    |  6  |  1  |  null  | null| null|
|  2   |   1    |  3  |  7  |   2    |  1  |  9  |
|  2   |   2    |  1  |  9  |  null  | null| null|
|  3   |   1    |  7  |  5  |   2    |  9  |  3  |
|  3   |   2    |  9  |  3  |   7    |  2  |  5  |
|  3   |   7    |  2  |  5  |   8    |  4  |  7  |
|  3   |   8    |  4  |  7  |  null  | null| null|
|  4   |   1    |  7  |  9  |   2    |  9  |  0  |
|  4   |   2    |  9  |  0  |  null  | null| null|

是否可以使用Spark DataFrames实现此目的?

Is it possible to achieve this using Spark DataFrames?

推荐答案

您可以使用窗口函数Lead. 首先通过使用id列进行分区来创建窗口,然后在调用withColumn函数时使用要显示的偏移值为1的列.

You can use window function lead. First create window by partitioning using id column and then while calling withColumn function use column that you want to show with offset value as 1.

类似这样的东西:

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('id).orderBy('time)
dataset.withColumn("time1", lead('time, 1) over windowSpec).show

您可以通过相同的方式添加其他列

You can add other columns by same way

这篇关于Spark DataFrames:合并两个连续的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆