创建基于给定操作列的新数据集 [英] Create a new dataset based given operation column

查看：116 发布时间：2020/9/4 0:04:21 apache-spark apache-spark-sql spark-streaming

本文介绍了创建基于给定操作列的新数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用spark-sql-2.3.1v并具有以下情况:

I am using spark-sql-2.3.1v and have the below scenario:

给出数据集:

val ds = Seq(
  (1, "x1", "y1", "0.1992019"),
  (2, null, "y2", "2.2500000"),
  (3, "x3", null, "15.34567"),
  (4, null, "y4", null),
  (5, "x4", "y4", "0")
   ).toDF("id","col_x", "col_y","value")

即

+---+-----+-----+---------+
| id|col_x|col_y|    value|
+---+-----+-----+---------+
|  1|   x1|   y1|0.1992019|
|  2| null|   y2|2.2500000|
|  3|   x3| null| 15.34567|
|  4| null|   y4|     null|
|  5|   x4|   y4|        0|
+---+-----+-----+---------+

要求:

我得到了需要从外部进行一些计算的运算列(即operationCol).

I get operational column (i.e., operationCol) on which I need to perform some calculation from outside.

在"col_x"列上执行某些操作时，我需要通过过滤掉所有具有"col_x"空值的记录并返回该新数据集来创建新数据集.

When performing some operations on column "col_x", I need to create a new dataset by filtering out all records which have "col_x" null values and return that new dataset.

同样，当对"col_y"列执行某些操作时，我需要通过过滤掉所有具有"col_y"空值的记录并返回该新数据集来创建一个新的数据集.

Likewise when performing some operations on column "col_y", I need to create a new dataset by filter out all records which have "col_y" null values and return that new dataset.

示例:

val operationCol ="col_x";

if(operationCol === "col_x"){
  //filter out all rows which has "col_x" null and return that new dataset.
}

if(operationCol === "col_y"){
  //filter out all rows which has "col_y" null and return that new dataset.
}

当operationCol ==="col_x"预期输出时:

+---+-----+-----+---------+
| id|col_x|col_y|    value|
+---+-----+-----+---------+
|  1|   x1|   y1|0.1992019|
|  3|   x3| null| 15.34567|
|  5|   x4|   y4|        0|
+---+-----+-----+---------+

当operationCol ==="col_y"预期输出时:

+---+-----+-----+---------+
| id|col_x|col_y|    value|
+---+-----+-----+---------+
|  1|   x1|   y1|0.1992019|
|  2| null|   y2|2.2500000|
|  4| null|   y4|     null|
|  5|   x4|   y4|        0|
+---+-----+-----+---------+

如何实现此预期输出? 换句话说，如何完成数据帧的分支?如何在流程中间创建一个新的数据框/数据集?

How to achieve this expected output? In other words, how can the branching of dataframe be done? How to create a new dataframe/dataset in the middle of the flow?

创建基于给定操作列的新数据集 [英] Create a new dataset based given operation column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

创建基于给定操作列的新数据集 [英] Create a new dataset based given operation column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭