创建基于给定操作列的新数据集 [英] Create a new dataset based given operation column
问题描述
我正在使用spark-sql-2.3.1v并具有以下情况:
I am using spark-sql-2.3.1v and have the below scenario:
给出数据集:
val ds = Seq(
(1, "x1", "y1", "0.1992019"),
(2, null, "y2", "2.2500000"),
(3, "x3", null, "15.34567"),
(4, null, "y4", null),
(5, "x4", "y4", "0")
).toDF("id","col_x", "col_y","value")
即
+---+-----+-----+---------+
| id|col_x|col_y| value|
+---+-----+-----+---------+
| 1| x1| y1|0.1992019|
| 2| null| y2|2.2500000|
| 3| x3| null| 15.34567|
| 4| null| y4| null|
| 5| x4| y4| 0|
+---+-----+-----+---------+
要求:
我得到了需要从外部进行一些计算的运算列(即operationCol
).
I get operational column (i.e., operationCol
) on which I need to perform some calculation from outside.
在"col_x"列上执行某些操作时,我需要通过过滤掉所有具有"col_x"空值的记录并返回该新数据集来创建新数据集.
When performing some operations on column "col_x", I need to create a new dataset by filtering out all records which have "col_x" null values and return that new dataset.
同样,当对"col_y"列执行某些操作时,我需要通过过滤掉所有具有"col_y"空值的记录并返回该新数据集来创建一个新的数据集.
Likewise when performing some operations on column "col_y", I need to create a new dataset by filter out all records which have "col_y" null values and return that new dataset.
示例:
val operationCol ="col_x";
if(operationCol === "col_x"){
//filter out all rows which has "col_x" null and return that new dataset.
}
if(operationCol === "col_y"){
//filter out all rows which has "col_y" null and return that new dataset.
}
当operationCol ==="col_x"预期输出时:
+---+-----+-----+---------+
| id|col_x|col_y| value|
+---+-----+-----+---------+
| 1| x1| y1|0.1992019|
| 3| x3| null| 15.34567|
| 5| x4| y4| 0|
+---+-----+-----+---------+
当operationCol ==="col_y"预期输出时:
+---+-----+-----+---------+
| id|col_x|col_y| value|
+---+-----+-----+---------+
| 1| x1| y1|0.1992019|
| 2| null| y2|2.2500000|
| 4| null| y4| null|
| 5| x4| y4| 0|
+---+-----+-----+---------+
如何实现此预期输出? 换句话说,如何完成数据帧的分支?如何在流程中间创建一个新的数据框/数据集?
How to achieve this expected output? In other words, how can the branching of dataframe be done? How to create a new dataframe/dataset in the middle of the flow?
推荐答案
You can use df.na.drop()
to drop rows that contains nulls. The drop function can take a list of the columns you want to consider as input, so in this case, you can write it as follows:
val newDf = df.na.drop(Seq(operationCol))
这将创建一个新的数据框newDf
,其中operationCol
中的所有行均已删除.
This will create a new dataframe newDf
with where all rows in operationCol
have been removed.
这篇关于创建基于给定操作列的新数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!