在Spark中将数据框转换为地图(键值) [英] Convert Dataframe to a Map(Key-Value) in Spark
问题描述
因此,我在Spark中有一个DataFrame,它看起来像这样:
So, I have a DataFrame in Spark which looks like this:
它有30列:仅显示其中一些!
It has 30 columns: only showing some of them!
[ABCD,color,NORMAL,N,2015-02-20,1]
[XYZA,color,NORMAL,N,2015-05-04,1]
[GFFD,color,NORMAL,N,2015-07-03,1]
[NAAS,color,NORMAL,N,2015-08-26,1]
[LOWW,color,NORMAL,N,2015-09-26,1]
[KARA,color,NORMAL,N,2015-11-08,1]
[ALEQ,color,NORMAL,N,2015-12-04,1]
[VDDE,size,NORMAL,N,2015-12-23,1]
[QWER,color,NORMAL,N,2016-01-18,1]
[KDSS,color,NORMAL,Y,2015-08-29,1]
[KSDS,color,NORMAL,Y,2015-08-29,1]
[ADSS,color,NORMAL,Y,2015-08-29,1]
[BDSS,runn,NORMAL,Y,2015-08-29,1]
[EDSS,color,NORMAL,Y,2015-08-29,1]
因此,我必须将此键转换为Scala中的键/值对,使用键作为Dataframe中的某些列,并为从索引0到count(键的不同数量)的那些键分配唯一的值
So, I have to convert this dataFrame into a key-Value Pair in Scala, using the key as some of the columns in the Dataframe and assigning unique values to those keys from index 0 to the count(distinct number of keys).
例如:使用上述情况,我想在Scala的map(键-值)集合中有一个输出,如下所示:
For example: using the case above, I want to have an output in a map(key-value) collection in Scala like this:
([ABC_color_NORMAL_N_1->0]
[XYZA_color_NORMAL_N_1->1]
[GFFD_color_NORMAL_N_1->2]
[NAAS_color_NORMAL_N_1->3]
[LOWW_color_NORMAL_N_1->4]
[KARA_color_NORMAL_N_1->5]
[ALEQ_color_NORMAL_N_1->6]
[VDDE_size_NORMAL_N_1->7]
[QWER_color_NORMAL_N_1->8]
[KDSS_color_NORMAL_Y_1->9]
[KSDS_color_NORMAL_Y_1->10]
[ADSS_color_NORMAL_Y_1->11]
[BDSS_runn_NORMAL_Y_1->12]
[EDSS_color_NORMAL_Y_1->13]
)
我是Scala和Spark的新手,我曾尝试做过类似的事情.
I'm new to Scala and Spark and I tried doing something Like this.
var map: Map[String, Int] = Map()
var i = 0
dataframe.foreach( record =>{
//Is there a better way of creating a key!
val key = record(0) + record(1) + record(2) + record(3)
var index = i
map += (key -> index)
i+=1
}
)
但是,这不起作用.://完成此操作后,地图为空.
But, this is not working.:/ The Map is null after this completes.
推荐答案
代码中的主要问题是试图在 workers上执行的代码中 modify 在驱动程序端创建的变量. .使用Spark时,您只能在RDD转换中使用驱动程序端变量作为只读"值.
The main issue in your code is trying to modify a variable created on driver-side within code executed on the workers. When using Spark, you can use driver-side variables within RDD transformations only as "read only" values.
特别是:
- 在驱动程序机器上创建地图
- 对地图(具有初始的空值)进行序列化,并将其发送到工作节点
- 每个节点可能会更改地图(本地)
-
foreach
完成后,结果就被丢弃了-结果不是 发送回驱动程序.
- The map is created on the driver machine
- The map (with its initial, empty value) is serialized and sent to worker nodes
- Each node might change the map (locally)
- Result is just thrown away when
foreach
is done - result is not sent back to driver.
要解决此问题-您应该选择一个返回更改的RDD(例如map
)以创建密钥的转换,使用zipWithIndex
添加正在运行的"id",然后使用collectAsMap
获取所有数据作为地图返回驱动程序:
To fix this - you should choose a transformation that returns a changed RDD (e.g. map
) to create the keys, use zipWithIndex
to add the running "ids", and then use collectAsMap
to get all the data back to the driver as a Map:
val result: Map[String, Long] = dataframe
.map(record => record(0) + record(1) + record(2) + record(3))
.zipWithIndex()
.collectAsMap()
关于密钥创建本身-假设您要包括前5列,并在它们之间添加分隔符(_
),则可以使用:
As for the key creation itself - assuming you want to include first 5 columns, and add a separator (_
) between them, you can use:
record => record.toList.take(5).mkString("_")
这篇关于在Spark中将数据框转换为地图(键值)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!