使用Scala将列分配给Spark Dataframe中的其他列 [英] Assigning columns to another columns in a Spark Dataframe using Scala

查看:110
本文介绍了使用Scala将列分配给Spark Dataframe中的其他列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找一个很好的问题,以提高我的Scala技能和答案:

I was looking at this excellent question so as to improve my Scala skills and the answer: Extract a column value and assign it to another column as an array in spark dataframe

我按如下所示创建了修改后的代码,但仍然存在一些问题:

I created my modified code as follows which works, but am left with a few questions:

import spark.implicits._   
import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
    ("r1", 1, 1),
    ("r2", 6, 4),
    ("r3", 4, 1),
    ("r4", 1, 2)
  )).toDF("ID", "a", "b")

val uniqueVal = df.select("b").distinct().map(x => x.getAs[Int](0)).collect.toList    
def myfun: Int => List[Int] = _ => uniqueVal 
def myfun_udf = udf(myfun)

df.withColumn("X", myfun_udf( col("b") )).show

+---+---+---+---------+
| ID|  a|  b|        X|
+---+---+---+---------+
| r1|  1|  1|[1, 4, 2]|
| r2|  6|  4|[1, 4, 2]|
| r3|  4|  1|[1, 4, 2]|
| r4|  1|  2|[1, 4, 2]|
+---+---+---+---------+

它有效,但是:

  • 我注意到b列被放置了两次.
  • 我也可以在第二条语句中的a列中输入相同的结果.例如.那是什么意思呢?

df.withColumn("X",myfun_udf(col("a"))).show

df.withColumn("X", myfun_udf( col("a") )).show

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆