将Spark数据框转换为可变映射 [英] Converting a Spark Dataframe to a mutable Map

查看:90
本文介绍了将Spark数据框转换为可变映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark和Scala的新手。我正在尝试在蜂巢中查询一个表(从表中选择2列),并将生成的数据框转换为Map。我正在使用Spark 1.6和Scala 2.10.6。

I am new to spark and scala. I am trying to query a table in hive(select 2 columns from the table) and convert the resulting dataframe into a Map. I am using Spark 1.6 with Scala 2.10.6.

例如:

Dataframe:
+--------+-------+
| address| exists|
+--------+-------+
|address1|   1   |
|address2|   0   |
|address3|   1   |
+--------+-------+ 
should be converted to: Map("address1" -> 1, "address2" -> 0, "address3" -> 1)

这是我正在使用的代码:

This is the code I am using:

val testMap: scala.collection.mutable.Map[String,Any] = Map()
val df= hiveContext.sql("select address,exists from testTable")
qualys.foreach( r => {
  val key = r(0).toString
  val value = r(1)
  testMap+=(key -> value)
  }
)
testMap.foreach(println)

运行上面的代码时,出现此错误:

When I run the above code, I get this error:

java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

在我试图将键值对添加到Map的那一行上抛出了这个错误。即 testMap + =(键->值)

It is throwing this error at the line where I am trying to add the key value pair to the Map. i.e. testMap+=(key -> value)

我知道有更好更好的方法使用 org.apache.spark.sql.functions.map 。但是,我使用的是Spark 1.6,我认为此功能不可用。我尝试执行 import ,但在可用函数列表中找不到它。

I know that there is a better and simpler way of doing this using the org.apache.spark.sql.functions.map. However, I am using Spark 1.6 and I don't think this function is available. I tried doing the import and I didn't find it in the list of available functions.

为什么是我的方法给我一个错误?有没有更好/优雅的方法可以通过spark 1.6实现呢?

why is my approach giving me an error? and is there a better/elegant way of achieving this with spark 1.6?

任何帮助将不胜感激。谢谢!

any help would be appreciated. Thank you!

更新:

我更改了元素的显示方式添加到地图中的以下内容: testMap.put(key,value)

I changed the way the elements are being added to the Map to the following: testMap.put(key, value).

我以前使用的是 + = 来添加元素。现在,我不再得到 java.lang.NoSuchMethodError 了。但是,没有元素被添加到 testMap 中。在foreach步骤完成之后,我尝试打印地图的大小及其中的所有元素,然后看到有 zero 元素。

I was previously using the += for adding the elements. Now i don't get the java.lang.NoSuchMethodError anymore. However, no elements are getting added to the testMap. After the foreach step is complete, I tried to print the size of the map and all the elements in it and I see that there are zero elements.

为什么未添加元素?我也愿意接受其他更好的方法。谢谢!!!!

Why are the elements not getting added? I am also open to any other better approach. Thank you!!

推荐答案

这可以分为3个步骤,每个步骤都已经在SO上解决了:

This can be broken down into 3 steps, each one already solved on SO:


  1. 将DataFrame转换为 RDD [(String,Int)]

  2. 在该RDD上调用 collectAsMap()以获得不变的地图

  3. 将该地图转换为可变地图(例如,如所述< a href = https://stackoverflow.com/a/5050653/5344058>此处)

  1. Convert DataFrame to RDD[(String, Int)]
  2. Call collectAsMap() on that RDD to get an immutable map
  3. Convert that map into a mutable one (e.g. as described here)

注意:我不知道为什么您需要一个 mutable 映射-值得注意的是,在Scala中使用 mutable 集合很少有意义。只坚持不可变的对象更安全,更容易推理。 忘记可变集合的存在使学习功能性API(例如Spark的!)变得容易得多。

NOTE: I don't know why you need a mutable map - it's worth noting that using a mutable collection rarely makes much sense in Scala. Sticking with immutable objects only is safer and easier to reason about. "Forgetting" about the existence of mutable collections makes learning functional APIs (like Spark's!) much easier.

这篇关于将Spark数据框转换为可变映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆