仅当列存在于数据框中时才应用withColumn [英] Applying withColumn only when column exists in the dataframe

查看:129
本文介绍了仅当列存在于数据框中时才应用withColumn的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Java 8上使用spark-sql-2.4.1v.我有一种情况,我会将列名作为list/Seq传递给我,对于这些列,我只需要执行某些操作即可,例如sum,avg,百分比等.

I am using spark-sql-2.4.1v with Java 8. I have a scenario where I will be passed the columns names as list/Seq, for those columns only i need to do perform certain operations like sum, avg, percentages etc.

在我的情况下,假设我有column1,column2,column3列.第一次,我将传递column1名称.

In my scenario, let's say I have column1, column2, column3 columns. First time I will pass column1 name.

将拉/选择"column1"数据并基于"column1"执行一些操作.第二次,我将传递column2名称,但是这次不拉取更早的column1,因此我的数据集不包含"column1"因此较早的条件因错误"AnalysisException:无法解析'column1'给定输入列"而中断".

Will pull/select "column1" data and perform some operation based on "column1". Second time I will pass column2 name, but earlier column1 not pulled this time so my dataset does not contain "column1" hence earlier conditions are breaking with error "AnalysisException: cannot resolve 'column1' given input columns".

因此,我需要检查列,如果存在某些列,则仅执行与列相关的操作,否则将忽略这些操作.

Hence I need to check the columns, if some column exists then only perform that column related operations else ignore those operations.

如何在Spark中执行此操作?

How to do this in Spark?

对数据库中的数据进行采样.

Sample data which is in database.

val data = List(
  ("20", "score", "school", "2018-03-31", 14 , 12 , 20),
  ("21", "score", "school", "2018-03-31", 13 , 13 , 21),
  ("22", "rate", "school", "2018-03-31", 11 , 14, 22),
  ("21", "rate", "school", "2018-03-31", 13 , 12, 23)
 )
    val df = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3")
.select("id", "code", "entity", "date", "column2") /// these are passed for each run....this set will keep changing.



  Dataset<Row> enrichedDs = df
             .withColumn("column1_org",col("column1"))
             .withColumn("column1",
                     when(col("column1").isNotNull() , functions.callUDF("lookUpData",col("column1").cast(DataTypes.StringType)))
                  );

以上逻辑仅在选择列中的"column1"列中适用.可用.这在第二组中失败为"column1".未选择,因此我需要一些理解,为什么这仅适用于选择列为"column1"的情况.可用.我需要一些逻辑来实现这一目标.

The above logic is only applicable when in select columns "column1" is available. This is failing in the second set as "column1" is not select, so I need some understanding why this only applicable when selected columns as "column1" is available. I need some logic to achieve this.

推荐答案

不确定我是否完全理解您的要求,但是您是否只是根据数据框中可用的哪些列进行了有条件的操作,而这些列在之前是未知的执行?

Not sure if i fully understand your requirement, but are you simply trying to perform some conditional operation depending on what columns are available in your dataframe which is not know prior to execution?

如果是这样,Dataframe.columns将返回一个列列表,您可以对其进行解析并进行相应选择

if so, Dataframe.columns returns a list of columns which you can parse and select accordingly

df.columns.foreach { println }

这篇关于仅当列存在于数据框中时才应用withColumn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆