Spark:将字符串列转换为数组 [英] Spark: Convert column of string to an array

查看:1766
本文介绍了Spark:将字符串列转换为数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将已作为字符串读取的列转换为数组列? 即从以下架构进行转换

How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

收件人:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

请尽可能共享scala和python实现. 与此相关的是,当我从文件本身读取内容时该如何处理呢? 我有约450列的数据,我想以这种格式指定的列很少. 目前,我正在pyspark中阅读以下内容:

Please share both scala and python implementation if possible. On a related note, how do I take care of it while reading from the file itself? I have data with ~450 columns and few of them I want to specify in this format. Currently I am reading in pyspark as below:

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

谢谢.

推荐答案

有各种方法,

最好的方法是使用split函数并将其强制转换为array<long>

The best way to do is using split function and cast to array<long>

data.withColumn("b", split(col("b"), ",").cast("array<long>"))

您还可以创建简单的udf来转换值

You can also create simple udf to convert the values

val tolong = udf((value : String) => value.split(",").map(_.toLong))

data.withColumn("newB", tolong(data("b"))).show

希望这会有所帮助!

这篇关于Spark:将字符串列转换为数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆