什么是更新星火的RDD内在价值的有效途径? [英] What is the efficient way to update value inside Spark's RDD?

查看:93
本文介绍了什么是更新星火的RDD内在价值的有效途径?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写在斯卡拉图形相关的程序与星火。该数据集有4个万个节点400万边缘(你可以把它当作一棵树),但每次(一个迭代),我只能编辑它的一部分,即子树由一个给定的节点为根,并且给定的节点和根之间的路径中的节点。

I'm writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but for each time(an Iteration), I only edit a portion of it, namely a sub-tree rooted by a given node, and the nodes in a path between that given node and root.

迭代有依赖,这意味着 I + 1 迭代需要从我。所以,我需要存储每个结果迭代的下一个步骤。

The Iteration has dependency, which means i+1 Iteration needs the result coming from i. So I need store the result of each Iteration for next step.

我试图找到一种有效的方式来更新 RDD ,但没有任何线索,以便far.I找到 PairRDD 有一个查找功能,可以减少从 O(N)的计算时间,O(<$ C $ç> M ), N 表示对象的总数 RDD M 表示在每个分区的元素数。

I'm trying to find an efficient way to update RDD, but have no clue so far.I find that PairRDD have a lookup function which could reduce the computation time from O(N), to O(M), N denote the total number of objects in RDD and M denote the number of elements in each partition.

所以我想反正是有,我可以更新的 RDD 0(M)的对象?或者更理想的是,O(1)?(我看到电子邮件Spark中的邮件列表说,查找可以进行修改,以达到O(1))

So I'm thinking is there anyway that I could update an object in the RDD with O(M)? Or more ideally, O(1)?(I see an email in Spark's mail list saying that the lookup can be modified to achieve O(1))

另一件事情是,如果我能做到 O(M)用于更新 RDD ,我能增加分区一定数目比内核我有,实现了更好的性能?数量较大

Another thing is, if I could achieve O(M) for updating the RDD, could I increase the partition to some number larger than the number of cores I have and achieve a better performance?

推荐答案

这是RDD是一个分布式数据集,一个分区为RDD存储单元,和处理单元和RDD是一个元素

An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.

例如,你看从HDFS一个大文件作为RDD,那么这个RDD的元素是字符串(在该文件中的行)和火花存储此RDD在由划分集群。对你来说,作为一个火花的用户,你只需要关心如何处理的文件的线条,就像你正在编写一个正常的程序,而且你读一行从本地文件系统行的文件。这就是火花的力量:)

For example, you read a large file from HDFS as an RDD, then the element of this RDD is String(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)

反正你不知道哪些元素将被存储在一个特定的分区,因此它没有意义来更新目标分区。

Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.

这篇关于什么是更新星火的RDD内在价值的有效途径?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆