仅当条件存在于数据框中的列时才应用条件 [英] Applying when condition only when column exists in the dataframe

查看:67
本文介绍了仅当条件存在于数据框中的列时才应用条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Java8中使用spark-sql-2.4.1v.如果给定的数据框列列表中存在列,我有一种情况需要执行某些操作

I am using spark-sql-2.4.1v with java8. I have a scenario where I need to perform certain operation if columns presents in the given dataframe column list

我有以下示例数据框,根据在数据库表上执行的外部查询,数据框的列会有所不同.

I have Sample data frame as below, the columns of dataframe would differ based on external query executed on the database table.

val data = List(
  ("20", "score", "school", "2018-03-31", 14 , 12 , 20),
  ("21", "score", "school", "2018-03-31", 13 , 13 , 21),
  ("22", "rate", "school", "2018-03-31", 11 , 14, 22),
  ("21", "rate", "school", "2018-03-31", 13 , 12, 23)
 )

val df = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3"..."columnN")

如上面的数据帧"data"所示.列不是固定的,并且会有所不同,并且将具有"column1","column2". ,"column3" ..."columnN" ...

as show above dataframe "data" columns are not fixed and would vary and would have "column1", "column2" ,"column3"..."columnN" ...

因此,取决于列的可用性,我需要执行一些操作 对于相同的原因,我试图使用"when-clause" ,当存在一列时,我必须在指定的列上执行某些操作,否则继续进行下一个操作.

So depend on the column availability i need to perform some operations for the same i am trying to use "when-clause" , when a column present then i have to perform certain operation on the specified column else move on to the next operation..

我正在尝试以下两种使用"when-cluase"的方法

I am trying below two ways using "when-cluase"

第一路:

 Dataset<Row> resultDs =  df.withColumn("column1_avg", 
                     when( df.schema().fieldNames().contains(col("column1")) , avg(col("column1"))))
                     )
 

第二种方式:

  Dataset<Row> resultDs =  df.withColumn("column2_sum", 
                     when( df.columns().contains(col("column2")) , sum(col("column1"))))
                     )

错误:

无法在数组类型String []上调用contains(Column)

Cannot invoke contains(Column) on the array type String[]

那么如何使用java8代码处理这种情况?

推荐答案

您可以创建包含所有列名称的列.那么您可以检查该列是否存在,并处理该列是否可用-

You can create a column having all the column names. then you can check if the column is present or not and process if it is available-

 df.withColumn("columns_available", array(df.columns.map(lit): _*))
      .withColumn("column1_org",
      when( array_contains(col("columns_available"),"column1") , col("column1")))
      .withColumn("x",
        when( array_contains(col("columns_available"),"column4") , col("column1")))
      .withColumn("column2_new",
        when( array_contains(col("columns_available"),"column2") , sqrt("column2")))
      .show(false)

这篇关于仅当条件存在于数据框中的列时才应用条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆