使用withColumn将两列添加到现有DataFrame [英] Adding two columns to existing DataFrame using withColumn

查看:735
本文介绍了使用withColumn将两列添加到现有DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个DataFrame,其中有几列.现在,我想在现有的DataFrame中再添加两列.

I have a DataFrame with a few columns. Now I want to add two more columns to the existing DataFrame.

目前,我正在使用DataFrame中的withColumn方法进行此操作.

Currently I am doing this using withColumn method in DataFrame.

例如:

df.withColumn("newColumn1", udf(col("somecolumn")))
  .withColumn("newColumn2", udf(col("somecolumn")))

实际上,我可以使用Array [String]在单个UDF方法中返回两个newcoOlumn值.但是目前这就是我的做法.

Actually I can return both newcoOlumn values in single UDF method using Array[String]. But currently this is how I am doing it.

无论如何,我可以有效地做到这一点吗?在这里使用explode是一个不错的选择吗?

Is there anyway, I can do this effectively? using explode is the good option here?

即使必须使用explode,也必须使用一次withColumn,然后将列值返回为Array[String],然后使用explode,再创建两个列.

Even if I have to use explode, I have to use withColumn once, then return the column value as Array[String], then using explode, create two more columns.

哪个有效?还是有其他选择?

Which one is effective? or is there any alternatives?

推荐答案

AFAIk,您需要调用withColumn两次(每个新列一次).但是,如果您的udf在计算上比较昂贵,则可以避免将它的两次调用,方法是将复杂"结果存储在临时列中,然后解包"结果,例如使用列的apply方法(可访问数组元素).请注意,有时有必要缓存中间结果(以防止在拆包过程中每行两次调用UDF),有时则不需要.这似乎取决于如何优化计划:

AFAIk you need to call withColumn twice (once for each new column). But if your udf is computationally expensive, you can avoid to call it twice with storing the "complex" result in a temporary column and then "unpacking" the result e.g. using the apply method of column (which gives access to the array element). Note that sometimes it's necessary to cache the intermediate result (to prevent that the UDF is called twice per row during unpacking), sometimes it's not needed. This seems to depend on how spark the optimizes the plan :

val myUDf = udf((s:String) => Array(s.toUpperCase(),s.toLowerCase()))

val df = sc.parallelize(Seq("Peter","John")).toDF("name")

val newDf = df
  .withColumn("udfResult",myUDf(col("name"))).cache 
  .withColumn("uppercaseColumn", col("udfResult")(0))
  .withColumn("lowercaseColumn", col("udfResult")(1))
  .drop("udfResult")

newDf.show()

给予

+-----+---------------+---------------+
| name|uppercaseColumn|lowercaseColumn|
+-----+---------------+---------------+
|Peter|          PETER|          peter|
| John|           JOHN|           john|
+-----+---------------+---------------+

使用UDF返回元组时,解压缩看起来像这样:

With an UDF returning a tuple, the unpacking would look like this:

val newDf = df
    .withColumn("udfResult",myUDf(col("name"))).cache
    .withColumn("lowercaseColumn", col("udfResult._1"))
    .withColumn("uppercaseColumn", col("udfResult._2"))
    .drop("udfResult")

这篇关于使用withColumn将两列添加到现有DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆