将String Array列转换为Spark Scala中的多个列 [英] Convert Array of String column to multiple columns in spark scala

查看:108
本文介绍了将String Array列转换为Spark Scala中的多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下架构的数据框:

I have a dataframe with following schema:

id         : int,
emp_details: Array(String)

一些示例数据:

1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)

此数据存在于数据帧中,我需要从数组中读取emp_details并将其分配给新列,如下所示,或者是否可以将该数组split分配给列名称为empname的多列,cityzip:

This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:

.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)

请您指导我们如何使用Spark(1.6)Scala实现这一目标.

Could you please guide how we can achieve this by using Spark (1.6) Scala.

非常感谢您的帮助...

Really appreciate your help...

非常感谢

推荐答案

您可以使用withColumnsplit来获取所需的数据

You can use withColumn and split to get the required data

df1.withColumn("empname", split($"emp_details" (0), "=")(1))
  .withColumn("city", split($"emp_details" (1), "=")(1))
  .withColumn("zip", split($"emp_details" (2), "=")(1)) 

输出:

+---+----------------------------------+-------+----+-----+
|id |emp_details                       |empname|city|zip  |
+---+----------------------------------+-------+----+-----+
|1  |[empname=xxx, city=yyy, zip=12345]|xxx    |yyy |12345|
|2  |[empname=bbb, city=bbb, zip=22345]|bbb    |bbb |22345|
+---+----------------------------------+-------+----+-----+

更新:
如果array中没有固定的数据顺序,则可以使用UDF转换为map并将其用作

UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as

val getColumnsUDF = udf((details: Seq[String]) => {
  val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
  (detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})

现在使用udf

df1.withColumn("emp",getColumnsUDF($"emp_details"))
 .select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
 .show(false)

输出:

+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1  |xxx    |xxx |xxx|
|2  |bbb    |bbb |bbb|
+---+-------+----+---+

希望这会有所帮助!

这篇关于将String Array列转换为Spark Scala中的多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆