转换顺序code转换成并行 [英] Converting sequential code into parallel
问题描述
我想了解马preduce星火。结果
如果做一些简单的练习,我没有问题,这样做的顺序方式,但是当涉及到parallelise我的code,我遇到困难。
I'm trying to understand MapReduce with Spark.
When doing some simple exercises, I have no problems, doing it the sequential way, but when it comes to parallelise my code, I experience difficulties.
请看下面的例子:
var = "Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts."
k = 10;
for x in range(0, len(var)):
print(var[0+x:k+x])
这将一个文本10个字符的带状疱疹( W-叠瓦)。< BR>
什么是使用星火转换它变成并行code的正确方法?
It splits a text into shingles of 10 characters (w-shingling).
What would be the right way to "convert" it into parallel code using Spark?
如何code一循环
?星火是否提供循环呢?
How to code a for loop
? Does Spark offer loops anyway?
我只需要了解整个概念。
I just need to understand the whole concept.
PS:我已经阅读文档,我知道什么是RDDS等,我只是不知道如何转化连续code成平行
PS: I've already read the documentation, I know what RDDs are etc. I just don't know, how to "convert" sequential code into parallel.
推荐答案
正如你可能已经看过,星火具有丰富功能的API。该RDD操作分为转换
和动作
,其中转换
可以被看作是该采取RDD和产生RDD功能: F(RDD)=&GT; RDD
和行动作为采取RDD并产生一些结果的函数(数组[T]
在的情况下收集
或在以下情况下单位
的foreach
)。
As you may have read already, Spark has a rich functional API. The RDD operations are classified into transformations
and actions
, where transformations
could be seen as functions that take an RDD and produce an RDD: f(RDD) => RDD
and actions as functions that take an RDD and produce some result (Array[T]
in case of collect
or Unit
in case of foreach
).
如何端口一定的算法,以星火的总体思路是要找到一种方法来恩preSS说,利用星火支持的功能模式算法,结合转换和行动,以实现所期望的结果。
The general idea on how to port a certain algorithm to Spark is to find a way to express said algorithm using the functional paradigm supported by Spark, by combining transformations and actions to achieve the desired outcome.
这取决于序列,如W-叠瓦以上,present挑战算法进行并行的还有中元素的顺序一个隐含的相依性,有时很难EX preSS是这样一种方式可能在不同的分区操作。
Algorithms that depend on a sequence, like the w-shingling above, present a challenge to be parallelized as there's an implicit dependency in the order of the elements that's sometimes difficult to express is such a way that could be operated in different partitions.
在这种情况下,我使用索引的方式来preserve序列,而在转型方面作出可能EX preSS算法:
In this case, I used indexing as a way to preserve the sequence while making possible to express the algorithm in terms of transformations:
def kShingle(rdd:RDD[Char], n:Int): RDD[Seq[Char]] = {
def loop(base: RDD[(Long, Seq[Char])], cumm: RDD[(Long, Seq[Char])], i: Int): RDD[Seq[Char]] = {
if (i<=1) cumm.map(_._2) else {
val segment = base.map{case (idx, seq) => (idx-1, seq)}
loop(segment, cumm.join(segment).map{case (k,(v1,v2)) => (k,v1 ++ v2)}, i-1)
}
}
val seqRdd = rdd.map(char => Seq(char))
val indexed = seqRdd.zipWithIndex.map(_.swap)
loop(indexed, indexed, n)
}
星火壳示例:
val rdd = sc.parallelize("Floppy Disk")
scala> kShingle(rdd,3).collect
res23: Array[Seq[Char]] = Array(List(F, l, o), List(i, s, k), List(l, o, p), List(o, p, p), List(p, p, y), List(p, y, ), List(y, , D), List( , D, i), List(D, i, s))
这篇关于转换顺序code转换成并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!