阿帕奇星火共用的反 [英] Apache Spark Shared Counter

查看:150
本文介绍了阿帕奇星火共用的反的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是相当新到Apache Spark和我使用的GraphX​​。所以,我必须使用Scala的,对此我也是新的; - )

I'm fairly new to Apache Spark and I'm using the GraphX. So I have to use Scala, to which I'm also new ;-).

更新时间:

我有一个图形,比方说像下面的图片:

I have a graph, let's say like in the following picture:

每一个节点都有自己的HashMap或列表,它里面可以存储的ID。现在我遍历图形的三元和如果边缘属性相匹配的条件(其在本例中忽略),那么我想将相同的ID存储在该边缘的开始和结束的节点。

Every Node has its own HashMap or List where it can store IDs. Now I'm iterating over the triplets of the graph and if the edge attribute is matching a criteria (which is ignored in this example) then I want to store the same ID in the start and end node of this edge.

在这一轮算法的结果可能是这样的:

After one round of this algorithm the result could look like this:

和这里的code(缩短):

And here the code (shortened):

val newNodes = graph.triplets.flatMap(triplet => {
    val newId = Counter.getId();
    val map = List((srcId, newId), (dstId, newId))
    // Outputs sth. like
    //  (1, 1)   (3, 1)   (2, 2)    (3, 2)
}

我从柜台对象的唯一ID:

I get the unique ID from the counter object:

object Counter{
    private var resultCount: Integer = 0;

    def getResultID(): Integer = {
        resultCount = resultCount + 1;
        return resultCount;
    }
}

在我的节点ID分组的所有元组,然后flatMap把所有ID为一个节点列表(与地图运营商)。这样的结果是节点3:(3,列表(1,2))。然后,这个结果保存到图形与外连接。

After the flatMap I'm grouping all tuples by the node id and then put all id for one node in a list (with a map-operator). So the outcome is for node 3: (3, List(1, 2)). This result is then stored back to the graph with an outerJoin.

所以我的问题是,我必须关心,那ID是唯一通过同步方法或者是没关系这样?如果有人已经通过解决整个问题不与拉链法给予明确的标识,例如另一个想法,那么这也将是不错: - )

So my question is, do I have to care about, that the IDs are unique by synchronizing the method or is it okay in this way? If somebody has another idea by solving the whole problem without giving explicitly IDs for example with the zip-Method, then this would also be nice :-).

除了这个问题,可以有人给我解释一下,什么是运行时的计数器对象期间发生了什么?因为它是一个单例,它驻留在其中执行驾驶员地方(对法师?),因为我读的地方,那你可以在正常code使用,然后变量使用,而这样做有星火并行计算,是复制到工人/线程,这不应该发生在这里。

Beside this question, can someone explain me, what is happening during runtime to the Counter object? Because it's a singleton, does it reside somewhere where the driver is executed (on the Master?), because I read somewhere, that variables which you can use in normal code and then are used while doing parallel computations with Spark, are copied to the Worker/Thread, which shouldn't happen here.

在此先感谢!

推荐答案

有没有必要实现这个你自己,分配唯一的ID是如此普遍,火花有它建于已在高清zipWithUniqueId( ):RDD [(T,长)] ,你可以看到它分配一个唯一的长的每个值,这意味着它返回的元组的RDD。用法示例:

There is no need to implement this your self, assigning unique ids is so common that spark has it built in already in def zipWithUniqueId(): RDD[(T, Long)] as you can see it assigns a unique long to each value, meaning it returns an RDD of tuples. Example usage:

val uniqIds = vertexData.zipWithUniqueId().map((k,v)=>(v,k)) //I'm assuming you want the unique ids as the vertexId

您也可以做到这一点的优势归结

You could also do this with the edge attributed

这篇关于阿帕奇星火共用的反的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆