仅当数据框中存在列时才应用 withColumn [英] Applying withColumn only when column exists in the dataframe

查看:26
本文介绍了仅当数据框中存在列时才应用 withColumn的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Java 8 中使用 spark-sql-2.4.1v.我有一个场景,我将列名作为列表/序列传递,对于那些列,我只需要执行某些操作,如 sum、avg、百分比等

I am using spark-sql-2.4.1v with Java 8. I have a scenario where I will be passed the columns names as list/Seq, for those columns only i need to do perform certain operations like sum, avg, percentages etc.

在我的场景中,假设我有 column1、column2、column3 列.第一次我将传递 column1 名称.

In my scenario, let's say I have column1, column2, column3 columns. First time I will pass column1 name.

将拉/选择column1";数据并根据column1"执行一些操作.第二次我将传递 column2 名称,但这次没有拉取较早的 column1,因此我的数据集不包含column1";因此,较早的条件因错误AnalysisException:无法解析 'column1' 给定的输入列"而中断.

Will pull/select "column1" data and perform some operation based on "column1". Second time I will pass column2 name, but earlier column1 not pulled this time so my dataset does not contain "column1" hence earlier conditions are breaking with error "AnalysisException: cannot resolve 'column1' given input columns".

因此我需要检查列,如果存在某个列,则只执行与该列相关的操作,否则忽略这些操作.

Hence I need to check the columns, if some column exists then only perform that column related operations else ignore those operations.

如何在 Spark 中执行此操作?

How to do this in Spark?

数据库中的样本数据.

val data = List(
  ("20", "score", "school", "2018-03-31", 14 , 12 , 20),
  ("21", "score", "school", "2018-03-31", 13 , 13 , 21),
  ("22", "rate", "school", "2018-03-31", 11 , 14, 22),
  ("21", "rate", "school", "2018-03-31", 13 , 12, 23)
 )
    val df = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3")
.select("id", "code", "entity", "date", "column2") /// these are passed for each run....this set will keep changing.



  Dataset<Row> enrichedDs = df
             .withColumn("column1_org",col("column1"))
             .withColumn("column1",
                     when(col("column1").isNotNull() , functions.callUDF("lookUpData",col("column1").cast(DataTypes.StringType)))
                  );

上述逻辑仅适用于选择列column1"的情况.可用.这在第二组中失败,因为column1"不是选择,所以我需要了解为什么这仅适用于选择列为column1"的情况.可用.我需要一些逻辑来实现这一点.

The above logic is only applicable when in select columns "column1" is available. This is failing in the second set as "column1" is not select, so I need some understanding why this only applicable when selected columns as "column1" is available. I need some logic to achieve this.

推荐答案

不确定我是否完全理解您的要求,但您是否只是尝试根据数据框中可用的列执行一些条件操作,而这些列是事先不知道的执行?

Not sure if i fully understand your requirement, but are you simply trying to perform some conditional operation depending on what columns are available in your dataframe which is not know prior to execution?

如果是这样,Dataframe.columns 会返回一个列列表,您可以相应地解析和选择这些列

if so, Dataframe.columns returns a list of columns which you can parse and select accordingly

df.columns.foreach { println }

这篇关于仅当数据框中存在列时才应用 withColumn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆