在火花海量数据运行reduceByKey [英] run reduceByKey on huge data in spark

查看:163
本文介绍了在火花海量数据运行reduceByKey的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我跑火花reduceByKey。我的计划是火花最简单的例子:

  VAL数= textFile.flatMap(行=> line.split())。重新分区(20000)。
                 .MAP(字=>(字,1))
                 .reduceByKey(_ + _,10000)
counts.saveAsTextFile(HDFS:// ......)

但它总是会耗尽内存...

我使用50台服务器,每台服务器35执行人,每台服务器140GB存储器M

文件体积是:
8TB的文件20十亿文件1000十亿字的总。
而这句话后减少将约为100亿美元。

我不知道如何设置火花的配置?

我不知道应该将这些参数是什么样的价值?

  1。该地图的数量? 20000的例子吗?
2的减少了多少? 10000的例子吗?
3.其他参数?


解决方案

这将是有益的,如果您发布的日志,但是一个选项是指定在最初的文本文件中读取数据时的分区数量较多(如 sc.textFile(路径,200000)),而不是重新划分看完之后。另一个重要的事情是要确保您的输入文件分割(某些COM pression选项使其不能分割的,在这种情况下,星火可能要读一个单一的机器造成奥姆斯上)。

其他的一些选项,因为你没有任何缓存数据,将会降低内存星火量预留缓存(控制与 spark.storage.memoryFraction ),也因为你是仅仅处理字符串的元组的工作我会推荐使用 org.apache.spark.serializer。
KryoSerializer
串行器。

I'm running reduceByKey in spark. My program is the simplest example of spark:

val counts = textFile.flatMap(line => line.split(" ")).repartition(20000).
                 .map(word => (word, 1))
                 .reduceByKey(_ + _, 10000)
counts.saveAsTextFile("hdfs://...")

but it always run out of memory...

I 'm using 50 servers , 35 executors per server, 140GB memory per server.

the documents volume is : 8TB documents, 20 billion documents, 1000 billion words in total. and the words after reduce will be about 100 million.

I wonder how to set the configuration of spark?

I wonder what value should these parameters be?

1. the number of the maps ? 20000 for example?
2. the number of the reduces ? 10000 for example?
3. others parameters?

解决方案

It would be helpful if you posted the logs, but one option would be to specify a larger number of partitions when reading in the initial text file (e.g. sc.textFile(path, 200000)) rather than re-partitioning after reading . Another important thing is to make sure that your input file is splittable (some compression options make it not splittable, and in that case Spark may have to read it on a single machine causing OOMs).

Some other options are, since you aren't caching any of the data, would be reducing the amount of memory Spark is setting aside for caching (controlled with with spark.storage.memoryFraction), also since you are only working with tuples of strings I'd recommend using the org.apache.spark.serializer. KryoSerializer serializer.

这篇关于在火花海量数据运行reduceByKey的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆