转换顺序code转换成并行 [英] Converting sequential code into parallel

查看:156
本文介绍了转换顺序code转换成并行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解马preduce星火。结果
如果做一些简单的练习,我没有问题,这样做的顺序方式,但是当涉及到​​parallelise我的code,我遇到困难。

I'm trying to understand MapReduce with Spark.
When doing some simple exercises, I have no problems, doing it the sequential way, but when it comes to parallelise my code, I experience difficulties.

请看下面的例子:

var = "Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts."
k = 10;

for x in range(0, len(var)):
    print(var[0+x:k+x])

这将一个文本10个字符的带状疱疹( W-叠瓦)。< BR>
什么是使用星火转换它变成并行code的正确方法?

It splits a text into shingles of 10 characters (w-shingling).
What would be the right way to "convert" it into parallel code using Spark?

如何code一循环?星火是否提供循环呢?

How to code a for loop? Does Spark offer loops anyway?

我只需要了解整个概念。

I just need to understand the whole concept.

PS:我已经阅读文档,我知道什么是RDDS等,我只是不知道如何转化连续code成平行

PS: I've already read the documentation, I know what RDDs are etc. I just don't know, how to "convert" sequential code into parallel.

推荐答案

正如你可能已经看过,星火具有丰富功能的API。该RDD操作分为转换动作,其中转换可以被看作是该采取RDD和产生RDD功能: F(RDD)=&GT; RDD 和行动作为采取RDD并产生一些结果的函数(数组[T] 的情况下收集或在以下情况下单位 的foreach )。

As you may have read already, Spark has a rich functional API. The RDD operations are classified into transformations and actions, where transformations could be seen as functions that take an RDD and produce an RDD: f(RDD) => RDD and actions as functions that take an RDD and produce some result (Array[T] in case of collect or Unit in case of foreach).

如何端口一定的算法,以星火的总体思路是要找到一种方法来恩preSS说,利用星火支持的功能模式算法,结合转换和行动,以实现所期望的结果。

The general idea on how to port a certain algorithm to Spark is to find a way to express said algorithm using the functional paradigm supported by Spark, by combining transformations and actions to achieve the desired outcome.

这取决于序列,如W-叠瓦以上,present挑战算法进行并行的还有中元素的顺序一个隐含的相依性,有时很难EX preSS是这样一种方式可能在不同的分区操作。

Algorithms that depend on a sequence, like the w-shingling above, present a challenge to be parallelized as there's an implicit dependency in the order of the elements that's sometimes difficult to express is such a way that could be operated in different partitions.

在这种情况下,我使用索引的方式来preserve序列,而在转型方面作出可能EX preSS算法:

In this case, I used indexing as a way to preserve the sequence while making possible to express the algorithm in terms of transformations:

def kShingle(rdd:RDD[Char], n:Int): RDD[Seq[Char]] = {
    def loop(base: RDD[(Long, Seq[Char])], cumm: RDD[(Long, Seq[Char])], i: Int): RDD[Seq[Char]] = {
       if (i<=1) cumm.map(_._2) else {
        val segment =  base.map{case (idx, seq) => (idx-1, seq)}
        loop(segment, cumm.join(segment).map{case (k,(v1,v2)) => (k,v1 ++ v2)}, i-1)
        }
    }
    val seqRdd = rdd.map(char => Seq(char))
    val indexed = seqRdd.zipWithIndex.map(_.swap)

    loop(indexed, indexed, n)
}

星火壳示例:

val rdd = sc.parallelize("Floppy Disk")
scala> kShingle(rdd,3).collect
res23: Array[Seq[Char]] = Array(List(F, l, o), List(i, s, k), List(l, o, p), List(o, p, p), List(p, p, y), List(p, y,  ), List(y,  , D), List( , D, i), List(D, i, s))

这篇关于转换顺序code转换成并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆