在Spark中将数据框转换为地图(键值) [英] Convert Dataframe to a Map(Key-Value) in Spark

查看：68 发布时间：2020/5/5 13:33:13 scala dictionary apache-spark

本文介绍了在Spark中将数据框转换为地图(键值)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，我在Spark中有一个DataFrame，它看起来像这样:

So, I have a DataFrame in Spark which looks like this:

它有30列:仅显示其中一些！

It has 30 columns: only showing some of them!

[ABCD,color,NORMAL,N,2015-02-20,1]
[XYZA,color,NORMAL,N,2015-05-04,1]
[GFFD,color,NORMAL,N,2015-07-03,1]
[NAAS,color,NORMAL,N,2015-08-26,1]
[LOWW,color,NORMAL,N,2015-09-26,1]
[KARA,color,NORMAL,N,2015-11-08,1]
[ALEQ,color,NORMAL,N,2015-12-04,1]
[VDDE,size,NORMAL,N,2015-12-23,1]
[QWER,color,NORMAL,N,2016-01-18,1]
[KDSS,color,NORMAL,Y,2015-08-29,1]
[KSDS,color,NORMAL,Y,2015-08-29,1]
[ADSS,color,NORMAL,Y,2015-08-29,1]
[BDSS,runn,NORMAL,Y,2015-08-29,1]
[EDSS,color,NORMAL,Y,2015-08-29,1]

因此，我必须将此键转换为Scala中的键/值对，使用键作为Dataframe中的某些列，并为从索引0到count(键的不同数量)的那些键分配唯一的值

So, I have to convert this dataFrame into a key-Value Pair in Scala, using the key as some of the columns in the Dataframe and assigning unique values to those keys from index 0 to the count(distinct number of keys).

例如:使用上述情况，我想在Scala的map(键-值)集合中有一个输出，如下所示:

For example: using the case above, I want to have an output in a map(key-value) collection in Scala like this:

    ([ABC_color_NORMAL_N_1->0]
    [XYZA_color_NORMAL_N_1->1]
    [GFFD_color_NORMAL_N_1->2]
    [NAAS_color_NORMAL_N_1->3]
    [LOWW_color_NORMAL_N_1->4]
    [KARA_color_NORMAL_N_1->5]
    [ALEQ_color_NORMAL_N_1->6]
    [VDDE_size_NORMAL_N_1->7]
    [QWER_color_NORMAL_N_1->8]
    [KDSS_color_NORMAL_Y_1->9]
    [KSDS_color_NORMAL_Y_1->10]
    [ADSS_color_NORMAL_Y_1->11]
    [BDSS_runn_NORMAL_Y_1->12]
    [EDSS_color_NORMAL_Y_1->13]
    )

我是Scala和Spark的新手，我曾尝试做过类似的事情.

I'm new to Scala and Spark and I tried doing something Like this.

 var map: Map[String, Int] = Map()
    var i = 0
    dataframe.foreach( record =>{
    //Is there a better way of creating a key!
        val key = record(0) + record(1) + record(2) + record(3)
        var index = i
        map += (key -> index)
        i+=1
          }
        )

但是，这不起作用.://完成此操作后，地图为空.

But, this is not working.:/ The Map is null after this completes.

推荐答案

代码中的主要问题是试图在 workers上执行的代码中 modify 在驱动程序端创建的变量. .使用Spark时，您只能在RDD转换中使用驱动程序端变量作为只读"值.

The main issue in your code is trying to modify a variable created on driver-side within code executed on the workers. When using Spark, you can use driver-side variables within RDD transformations only as "read only" values.

特别是:

在驱动程序机器上创建地图
对地图(具有初始的空值)进行序列化，并将其发送到工作节点
每个节点可能会更改地图(本地)
foreach完成后，结果就被丢弃了-结果不是发送回驱动程序.

The map is created on the driver machine
The map (with its initial, empty value) is serialized and sent to worker nodes
Each node might change the map (locally)
Result is just thrown away when foreach is done - result is not sent back to driver.

要解决此问题-您应该选择一个返回更改的RDD(例如map)以创建密钥的转换，使用zipWithIndex添加正在运行的"id"，然后使用collectAsMap获取所有数据作为地图返回驱动程序:

To fix this - you should choose a transformation that returns a changed RDD (e.g. map) to create the keys, use zipWithIndex to add the running "ids", and then use collectAsMap to get all the data back to the driver as a Map:

val result: Map[String, Long] = dataframe
  .map(record => record(0) + record(1) + record(2) + record(3))
  .zipWithIndex()
  .collectAsMap()

关于密钥创建本身-假设您要包括前5列，并在它们之间添加分隔符(_)，则可以使用:

As for the key creation itself - assuming you want to include first 5 columns, and add a separator (_) between them, you can use:

record => record.toList.take(5).mkString("_")

这篇关于在Spark中将数据框转换为地图(键值)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Spark中将数据框转换为地图(键值) [英] Convert Dataframe to a Map(Key-Value) in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark中将数据框转换为地图(键值) [英] Convert Dataframe to a Map(Key-Value) in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭