星火巨额名单FlatMap功能 [英] Spark FlatMap function for huge lists

查看：207 发布时间：2016/5/22 15:17:58 mapreduce apache-spark

本文介绍了星火巨额名单FlatMap功能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常基本的问题。星火的 flatMap 功能可以让你散发出每个输入0,1或多个输出。所以，你喂flatMap的（lambda）函数应该返回一个列表。

我的问题是：？如果这个名单是太大的内存来处理会发生什么

我目前不能实现这一点，这个问题应前我重写我的马preduce软件，可以轻松地在 context.write（）在任何地方我的算法，我想。（单映射器的输出可以很容易地大量千兆字节。

在如果你有兴趣：一个映射器做某种字数统计的，但实际上在产生所有可能的串，具有广泛的正则表达式前pressions与文本匹配在一起。（生物信息学用例）


解决方案

所以，你喂flatMap的（lambda）函数应该返回一个列表。没有，它没有返回列表。在实践中，你可以很容易地使用一个懒惰的序列。这可能是更容易发现看一看斯卡拉当 RDD.flatMap 签名： flatMap [U]（F（T）⇒TraversableOnce [U]）
由于 TraversableOnce 的子类包括的 SeqView 或的流你可以使用一个懒惰的序列，而不是一个列表。例如：
  VAL RDD = sc.parallelize（富::栏::无）
rdd.flatMap {X = GT; （1〜1000000000）.view.map {
    _ =＆GT; （X，scala.util.Random.nextLong）
}}
 
既然你提到lambda函数我假设你正在使用PySpark。你可以做最简单的事情就是返回发电机的，而不是列表：
 导入numpy的是NPRDD = sc.parallelize（[富，酒吧]）
rdd.flatMap（波长X：（（X，np.random.randint（1000）），用于在_的xrange（100000000）））
 
由于 RDDS 懒洋洋地评估它甚至可以从 flatMap 返回一个无限序列。使用 toolz 的力量一点点：
 从toolz.itertoolz进口迭代
DEF公司（X）：
    返回X + 1rdd.flatMap（拉姆达X：（（I，X）为i的迭代（增量，0）））取（1）
 
I have a very basic question. Spark's flatMap function allows you the emit 0,1 or more outputs per input. So the (lambda) function you feed to flatMap should return a list.

My question is: what happens if this list is too large for your memory to handle!?

I haven't currently implemented this, the question should be resolved before I rewrite my MapReduce software which could easily deal with this by putting context.write() anywhere in my algorithm I wanted to. (the output of a single mapper could easily lots of gigabytes.

In case you're interested: a mappers does some sort of a word count, but in fact in generates all possible substrings, together with a wide range of regex expressions matching with the text. (bioinformatics use case)
解决方案

So the (lambda) function you feed to flatMap should return a list.

No, it doesn't have to return list. In practice you can easily use a lazy sequence. It is probably easier to spot when take a look at the Scala RDD.flatMap signature:
flatMap[U](f: (T) ⇒ TraversableOnce[U])
Since subclasses of TraversableOnce include SeqView or Stream you can use a lazy sequence instead of a List. For example:
val rdd = sc.parallelize("foo" :: "bar" :: Nil)
rdd.flatMap {x => (1 to 1000000000).view.map {
    _ => (x, scala.util.Random.nextLong)
}}
Since you've mentioned lambda function I assume you're using PySpark. The simplest thing you can do is to return a generator instead of list:
import numpy as np

rdd = sc.parallelize(["foo", "bar"])
rdd.flatMap(lambda x: ((x, np.random.randint(1000)) for _ in xrange(100000000)))
Since RDDs are lazily evaluated it is even possible to return an infinite sequence from the flatMap. Using a little bit of toolz power:
from toolz.itertoolz import iterate
def inc(x):
    return x + 1

rdd.flatMap(lambda x: ((i, x) for i in iterate(inc, 0))).take(1)
这篇关于星火巨额名单FlatMap功能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火巨额名单FlatMap功能 [英] Spark FlatMap function for huge lists

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火巨额名单FlatMap功能 [英] Spark FlatMap function for huge lists

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭