Apache的火花内存消耗的高速缓存()/坚持() [英] apache-spark memory consumption for cache() / persist()

查看:277
本文介绍了Apache的火花内存消耗的高速缓存()/坚持()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试缓存()或持续(MEMORY_ONLY_SER())我RDDS我的火花集群挂起。它的伟大工程,并在大约7分钟计算的结果。如果我不使用缓存()。

My spark cluster hangs when I try to cache() or persist(MEMORY_ONLY_SER()) my RDDs. It works great and computes results in about 7min. if I don't use cache().

我有6 c3.xlarge EC2实例(4核,7.5 GB RAM每个),37.7 GB。

I've got 6 c3.xlarge EC2 instances (4 cores, 7.5 GB RAM each), which gives in total 24 cores and 37.7 GB.

我在主机上运行我用下面的命令应用程序:

I run my application with the following command on master:

SPARK_MEM =5克MEMORY_FRACTION =0.6SPARK_HOME =/根/火花java命令./uber-offline.jar:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop1.0.4.jar pl.instream.dsp.offline.OfflineAnalysis

SPARK_MEM=5g MEMORY_FRACTION="0.6" SPARK_HOME="/root/spark" java -cp ./uber-offline.jar:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop1.0.4.jar pl.instream.dsp.offline.OfflineAnalysis

该数据集是关于划分成24个文件50GB的数据。我COM pressed并存储在S3存储桶的24个文件(其中每一个有7MB的大小300MB)。

The data set is about 50GB of data partitioned into 24 files. I compressed it and stored in S3 bucket in 24 files (where each of it has size of 7MB to 300MB).

我绝对不能找到我的集群的这种行为的理由,但似乎像火花消耗所有可用内存,并钻进GC收集循环。当我看着GC冗长,我能找到一个周期象下面这样:

I absolutely can't find a reason for such behaviour of my cluster, but it seems, like spark consumed all available memory and got into GC collecting loop. When I look into gc verbose, I can find a cycles like below:

[GC 5208198K(5208832K), 0,2403780 secs]
[Full GC 5208831K->5208212K(5208832K), 9,8765730 secs]
[Full GC 5208829K->5208238K(5208832K), 9,7567820 secs]
[Full GC 5208829K->5208295K(5208832K), 9,7629460 secs]
[GC 5208301K(5208832K), 0,2403480 secs]
[Full GC 5208831K->5208344K(5208832K), 9,7497710 secs]
[Full GC 5208829K->5208366K(5208832K), 9,7542880 secs]
[Full GC 5208831K->5208415K(5208832K), 9,7574860 secs]

这最终导致了信息,如:

This finally leads to the messages like:

WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-xx-xx-xxx-xxx.eu-west-1.compute.internal, 60048, 0) with no recent heart beats: 64828ms exceeds 45000ms

...并停止在计算任何进展。这看起来像内存在100%的消耗,但我想用机器与更多的RAM(如每30GB),效果是一样的。

...and stops any progress in computing. This looks like the memory was consumed in 100%, but I tried to use machines with more RAM (like 30GB each), and the effect is the same.

什么可能是这种行为的原因?任何人可以帮助?

What might be the reason of such behaviour?? Could anybody help??

推荐答案

尝试使用更多的分区,你应该有2 - 4%的CPU。 IME增加分区的数量是往往使程序更稳定(并经常更快)的最简单的方法。

Try using more partitions, you should have 2 - 4 per CPU. IME increasing the number of partitions is often the easiest way to make a program more stable (and often faster).

在默认情况下,我认为你的code将使用24个分区,但对于50 GB的数据,那就是太少。我想尝试一些100个分区最少。

By default I think your code will use 24 partitions, but for 50 GB of data that is far too little. I'd try a few 100 partitions at least.

接下来,使用 SPARK_MEM =5克不得不说每个节点有7.5 GB,所以你还不如 SPARK_MEM =7500米

Next you use SPARK_MEM=5g but say each node has 7.5 GB, so you might as well have SPARK_MEM=7500m.

您也可以尝试增加内存部分,但我认为上面是更可能的帮助。

You could also try increasing the memory fraction, but I think the above is more likely to help.

一般点:使用HDFS你文件未S3,这是非常快。确保缓存之前正确munge您的数据 - 例如,如果你有说有100列TSV数据,但只使用领域的10,然后确保你提取这些领域尝试缓存之前。

General points: use HDFS for you files not s3, it's hugely faster. Ensure you munge your data properly before caching it - e.g. if you have say TSV data with 100 columns, but you only use 10 of the fields, then make sure you've extracted those fields before you try to cache.

这篇关于Apache的火花内存消耗的高速缓存()/坚持()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆