仅当数据框中存在列时才应用条件 [英] Applying when condition only when column exists in the dataframe
问题描述
我在 java8 中使用 spark-sql-2.4.1v.我有一个场景,如果列出现在给定的数据框列列表中,我需要执行某些操作
我有如下示例数据框,数据框的列会根据在数据库表上执行的外部查询而有所不同.
val data = List((20"、分数"、学校"、2018-03-31"、14、12、20),(21"、分数"、学校"、2018-03-31"、13、13、21),(22"、费率"、学校"、2018-03-31"、11、14、22),(21"、费率"、学校"、2018-03-31"、13、12、23))val df = data.toDF(id"、代码"、实体"、日期"、column1"、column2"、column3"...columnN")
如上图所示的数据框data"列不是固定的,会有所不同,并且会有column1"、column2".,column3"...columnN"...
所以取决于列的可用性我需要执行一些操作同样,我试图使用when-clause";, 当一列存在时,我必须对指定的列执行某些操作,否则继续进行下一个操作..
我正在尝试以下两种使用when-cluase"的方法
<块引用>第一种方式:
数据集<行>resultDs = df.withColumn(column1_avg",当( df.schema().fieldNames().contains(col(column1")) , avg(col(column1")))))
<块引用>
第二种方式:
数据集<行>resultDs = df.withColumn(column2_sum",当( df.columns().contains(col(column2")) , sum(col(column1")))))
错误:
无法在数组类型 String[] 上调用 contains(Column)
那么如何使用java8代码处理这种情况?
您可以创建一个包含所有列名称的列.然后您可以检查该列是否存在并处理它是否可用-
df.withColumn("columns_available", array(df.columns.map(lit): _*)).withColumn(column1_org",当(array_contains(col(columns_available"),column1"),col(column1"))).withColumn(x",当(array_contains(col(columns_available"),column4"),col(column1"))).withColumn(column2_new",当(array_contains(col(columns_available"),column2"),sqrt(column2"))).show(假)
I am using spark-sql-2.4.1v with java8. I have a scenario where I need to perform certain operation if columns presents in the given dataframe column list
I have Sample data frame as below, the columns of dataframe would differ based on external query executed on the database table.
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12 , 20),
("21", "score", "school", "2018-03-31", 13 , 13 , 21),
("22", "rate", "school", "2018-03-31", 11 , 14, 22),
("21", "rate", "school", "2018-03-31", 13 , 12, 23)
)
val df = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3"..."columnN")
as show above dataframe "data" columns are not fixed and would vary and would have "column1", "column2" ,"column3"..."columnN" ...
So depend on the column availability i need to perform some operations for the same i am trying to use "when-clause" , when a column present then i have to perform certain operation on the specified column else move on to the next operation..
I am trying below two ways using "when-cluase"
First-way :
Dataset<Row> resultDs = df.withColumn("column1_avg",
when( df.schema().fieldNames().contains(col("column1")) , avg(col("column1"))))
)
Second-way :
Dataset<Row> resultDs = df.withColumn("column2_sum",
when( df.columns().contains(col("column2")) , sum(col("column1"))))
)
Error:
Cannot invoke contains(Column) on the array type String[]
so how to handle this scenario using java8 code ?
You can create a column having all the column names. then you can check if the column is present or not and process if it is available-
df.withColumn("columns_available", array(df.columns.map(lit): _*))
.withColumn("column1_org",
when( array_contains(col("columns_available"),"column1") , col("column1")))
.withColumn("x",
when( array_contains(col("columns_available"),"column4") , col("column1")))
.withColumn("column2_new",
when( array_contains(col("columns_available"),"column2") , sqrt("column2")))
.show(false)
这篇关于仅当数据框中存在列时才应用条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!