星火巨额名单FlatMap功能 [英] Spark FlatMap function for huge lists
问题描述
我有一个非常基本的问题。星火的 flatMap
功能可以让你散发出每个输入0,1或多个输出。所以,你喂flatMap的(lambda)函数应该返回一个列表。
我的问题是:?如果这个名单是太大的内存来处理会发生什么
我目前不能实现这一点,这个问题应前我重写我的马preduce软件,可以轻松地在 context.write()$ C $处理这解决C>在任何地方我的算法,我想。 (单映射器的输出可以很容易地大量千兆字节。
在如果你有兴趣:一个映射器做某种字数统计的,但实际上在产生所有可能的串,具有广泛的正则表达式前pressions与文本匹配在一起。 (生物信息学用例)
所以,你喂flatMap的(lambda)函数应该返回一个列表。
块引用>没有,它没有返回列表。在实践中,你可以很容易地使用一个懒惰的序列。这可能是更容易发现看一看斯卡拉当
RDD.flatMap
签名:flatMap [U](F(T)⇒TraversableOnce [U])
由于
TraversableOnce
的子类包括的SeqView
或的流
你可以使用一个懒惰的序列,而不是一个列表
。例如:VAL RDD = sc.parallelize(富::栏::无)
rdd.flatMap {X = GT; (1〜1000000000).view.map {
_ => (X,scala.util.Random.nextLong)
}}既然你提到lambda函数我假设你正在使用PySpark。你可以做最简单的事情就是返回发电机的,而不是列表:
导入numpy的是NPRDD = sc.parallelize([富,酒吧])
rdd.flatMap(波长X:((X,np.random.randint(1000)),用于在_的xrange(100000000)))由于
RDDS
懒洋洋地评估它甚至可以从flatMap
返回一个无限序列。使用toolz
的力量一点点:从toolz.itertoolz进口迭代
DEF公司(X):
返回X + 1rdd.flatMap(拉姆达X:((I,X)为i的迭代(增量,0)))取(1)I have a very basic question. Spark's
flatMap
function allows you the emit 0,1 or more outputs per input. So the (lambda) function you feed to flatMap should return a list.My question is: what happens if this list is too large for your memory to handle!?
I haven't currently implemented this, the question should be resolved before I rewrite my MapReduce software which could easily deal with this by putting
context.write()
anywhere in my algorithm I wanted to. (the output of a single mapper could easily lots of gigabytes.In case you're interested: a mappers does some sort of a word count, but in fact in generates all possible substrings, together with a wide range of regex expressions matching with the text. (bioinformatics use case)
解决方案So the (lambda) function you feed to flatMap should return a list.
No, it doesn't have to return list. In practice you can easily use a lazy sequence. It is probably easier to spot when take a look at the Scala
RDD.flatMap
signature:flatMap[U](f: (T) ⇒ TraversableOnce[U])
Since subclasses of
TraversableOnce
includeSeqView
orStream
you can use a lazy sequence instead of aList
. For example:val rdd = sc.parallelize("foo" :: "bar" :: Nil) rdd.flatMap {x => (1 to 1000000000).view.map { _ => (x, scala.util.Random.nextLong) }}
Since you've mentioned lambda function I assume you're using PySpark. The simplest thing you can do is to return a generator instead of list:
import numpy as np rdd = sc.parallelize(["foo", "bar"]) rdd.flatMap(lambda x: ((x, np.random.randint(1000)) for _ in xrange(100000000)))
Since
RDDs
are lazily evaluated it is even possible to return an infinite sequence from theflatMap
. Using a little bit oftoolz
power:from toolz.itertoolz import iterate def inc(x): return x + 1 rdd.flatMap(lambda x: ((i, x) for i in iterate(inc, 0))).take(1)
这篇关于星火巨额名单FlatMap功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!