修改星火RDD的foreach内集合 [英] Modify collection inside a Spark RDD foreach

查看:140
本文介绍了修改星火RDD的foreach内集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将元素添加到地图,而迭代的RDD的元素。我没有得到任何错误,但修改不会发生。

I'm trying to add elements to a map while iterating the elements of an RDD. I'm not getting any errors, but the modifications are not happening.

这一切工作正常,直接增加或迭代其他集合:

It all works fine adding directly or iterating other collections:

scala> val myMap = new collection.mutable.HashMap[String,String]
myMap: scala.collection.mutable.HashMap[String,String] = Map()

scala> myMap("test1")="test1"

scala> myMap
res44: scala.collection.mutable.HashMap[String,String] = Map(test1 -> test1)

scala> List("test2", "test3").foreach(w => myMap(w) = w)

scala> myMap
res46: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3)

但是,当我尝试从RDD做同样的:

But when I try to do the same from an RDD:

scala> val fromFile = sc.textFile("tests.txt")
...
scala> fromFile.take(3)
...
res48: Array[String] = Array(test4, test5, test6)

scala> fromFile.foreach(w => myMap(w) = w)
scala> myMap
res50: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3)

我试过打印地图的内容,因为它是在foreach以确保变量是相同的面前,它正确打印:

I've tried printing the contents of the map as it was before the foreach to make sure the variable is the same, and it prints correctly:

fromFile.foreach(w => println(myMap("test1")))
...
test1
test1
test1
...

我也印在foreach code里面的地图修饰元素,并将其作为打印修改,但在操作完成时,地图似乎不变。

I've also printed the modified element of the map inside the foreach code and it prints as modified, but when the operation is completed, the map seems unmodified.

scala> fromFile.foreach({w => myMap(w) = w; println(myMap(w))})
...
test4
test5
test6
...
scala> myMap
res55: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3)

转换的RDD到一个数组(收集)也能正常工作:

Converting the RDD to an array (collect) also works fine:

fromFile.collect.foreach(w => myMap(w) = w)
scala> myMap
res89: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test5 -> test5, test1 -> test1, test4 -> test4, test6 -> test6, test3 -> test3)

这是一个方面的问题?我在访问正在修改别的地方?数据的副本

Is this a context problem? Am I accessing a copy of the data that is being modified somewhere else?

推荐答案

一个Spark集群(不是单一机)上运行时,它变得更清晰。该RDD是在多台机器现在全$ P $垫。当你调用的foreach ,你告诉每一台机器做什么用的那块RDD的,它有。如果你指任何局部变量(如 MYMAP ),他们得到的序列化并将其发送到机器,这样他们就可以使用它。但没有回来。所以,你的正本复印件 MYMAP 不受影响。

It becomes clearer when running on a Spark cluster (not a single machine). The RDD is now spread over several machines. When you call foreach, you tell each machine what to do with the piece of the RDD that it has. If you refer to any local variables (like myMap), they get serialized and sent to the machines, so they can use it. But nothing comes back. So your original copy of myMap is unaffected.

我觉得这个回答你的问题,但很明显,你所要完成的东西,你不能到那里这种方式。随意在这里或在一个单独的问题,你正在尝试做解释了,我会尽力帮助。

I think this answers your question, but obviously you are trying to accomplish something and you will not be able to get there this way. Feel free to explain here or in a separate question what you are trying to do, and I will try to help.

这篇关于修改星火RDD的foreach内集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆