如何将Scala RDD转换为地图 [英] How to convert Scala RDD to Map

查看:81
本文介绍了如何将Scala RDD转换为地图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个RDD(字符串数组)org.apache.spark.rdd.RDD[String] = MappedRDD[18] 并将其转换为具有唯一ID的地图.我做了'val vertexMAp = vertices.zipWithUniqueId' 但这给了我另一个'org.apache.spark.rdd.RDD[(String, Long)]'类型的RDD,但是我想要一个'Map[String, Long]'.如何转换我的'org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]'?

I have a RDD (array of String) org.apache.spark.rdd.RDD[String] = MappedRDD[18] and to convert it to a map with unique Ids. I did 'val vertexMAp = vertices.zipWithUniqueId' but this gave me another RDD of type 'org.apache.spark.rdd.RDD[(String, Long)]' but I want a 'Map[String, Long]' . How can I convert my 'org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]' ?

谢谢

推荐答案

PairRDDFunctions中有一个内置的collectAsMap函数,可以为您提供RDD中对值的映射.

There's a built-in collectAsMap function in PairRDDFunctions that would deliver you a map of the pair values in the RDD.

val vertexMAp = vertices.zipWithUniqueId.collectAsMap

请记住,RDD是分布式数据结构,这一点很重要.您可以将其可视化为散布在整个群集中的数据的一部分". collect时,您必须将所有这些片段都交给驱动程序并能够做到这一点,它们需要装入驱动程序的内存中.

It's important to remember that an RDD is a distributed data structure. You can visualize it a 'pieces' of your data spread over the cluster. When you collect, you force all those pieces to go to the driver and to be able to do that, they need to fit in the memory of the driver.

从注释中看,您的情况似乎需要处理大型数据集.用它制作地图无法正常工作,因为它不适合驱动程序的内存.如果尝试,则会导致OOM异常.

From the comments, it looks like in your case, you need to deal with a large dataset. Making a Map out of it is not going to work as it won't fit on the driver's memory; causing OOM exceptions if you try.

您可能需要将数据集保留为RDD.如果要创建地图以查找元素,则可以在PairRDD上使用lookup,如下所示:

You probably need to keep the dataset as an RDD. If you are creating a Map in order to lookup elements, you could use lookup on a PairRDD instead, like this:

import org.apache.spark.SparkContext._  // import implicits conversions to support PairRDDFunctions

val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")

这篇关于如何将Scala RDD转换为地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆