如何在Apache Spark/Scala中从包装数组中获取数据 [英] How to get data out of Wrapped Array in Apache Spark / Scala

查看：90 发布时间：2021/4/8 20:05:33 scala apache-spark

本文介绍了如何在Apache Spark/Scala中从包装数组中获取数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，其中的行如下所示:

I have a Dataframe with rows that look like this:

[WrappedArray(1, 5DC7F285-052B-4739-8DC3-62827014A4CD, 1, 1425450997, 714909, 1425450997, 714909, {}, 2013, GAVIN, ST LAWRENCE, M, 9)]
[WrappedArray(2, 17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF, 2, 1425450997, 714909, 1425450997, 714909, {}, 2013, LEVI, ST LAWRENCE, M, 9)]
[WrappedArray(3, 53E20DA8-8384-4EC1-A9C4-071EC2ADA701, 3, 1425450997, 714909, 1425450997, 714909, {}, 2013, LOGAN, NEW YORK, M, 44)]
...

年份之前的所有内容(在此示例中为2013)都是胡说八道，应该删除.我想将数据映射到我创建的 Name 类，并将其放入新的数据框中.

Everything before the year (2013 in this example) is nonsense that should be dropped. I would like to map the data to a Name class that I have created and put it into a new dataframe.

如何获取数据并进行映射?

How do I get to the data and do that mapping?

这是我的 Name 类:

case class Name(year: Int, first_name: String, county: String, sex: String, count: Int)

基本上，我想根据 Name 类的模式用行和列填充数据框.我知道如何做这部分，但我只是不知道如何获取数据框中的数据.

Basically, I would like to fill my dataframe with rows and columns according to the schema of the Name class. I know how to do this part, but I just don't know how to get to the data in the dataframe.

推荐答案

假定数据是这样的字符串数组:

Assuming the data is an array of strings like this:

val df = Seq(Seq("1", "5DC7F285-052B-4739-8DC3-62827014A4CD", "1", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "GAVIN", "STLAWRENCE", "M", "9"),
    Seq("2", "17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF", "2", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LEVI", "ST LAWRENCE", "M", "9"),
    Seq("3", "53E20DA8-8384-4EC1-A9C4-071EC2ADA701", "3", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LOGAN", "NEW YORK", "M", "44"))
  .toDF("array")

您可以使用返回案例类的 UDF ，也可以多次使用 withColumn .后者应该更有效，并且可以这样完成:

You could either use an UDF that returns a case class or you can use withColumn multiple times. The latter should be more efficient and can be done like this:

val df2 = df.withColumn("year", $"array"(8).cast(IntegerType))
  .withColumn("first_name", $"array"(9))
  .withColumn("county", $"array"(10))
  .withColumn("sex", $"array"(11))
  .withColumn("count", $"array"(12).cast(IntegerType))
  .drop($"array")
  .as[Name]

这将为您提供 DataSet [Name] :

+----+----------+-----------+---+-----+
|year|first_name|county     |sex|count|
+----+----------+-----------+---+-----+
|2013|GAVIN     |STLAWRENCE |M  |9    |
|2013|LEVI      |ST LAWRENCE|M  |9    |
|2013|LOGAN     |NEW YORK   |M  |44   |
+----+----------+-----------+---+-----+

希望有帮助！

这篇关于如何在Apache Spark/Scala中从包装数组中获取数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Apache Spark/Scala中从包装数组中获取数据 [英] How to get data out of Wrapped Array in Apache Spark / Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Apache Spark/Scala中从包装数组中获取数据 [英] How to get data out of Wrapped Array in Apache Spark / Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭