Spark:混洗操作,导致GC停顿时间过长 [英] Spark: shuffle operation leading to long GC pause
本文介绍了Spark:混洗操作,导致GC停顿时间过长的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在运行 Spark 2
,并试图对5 TB左右的json进行洗牌。在洗牌 Dataset
时,我遇到了非常长的垃圾回收暂停:
val operations = spark.read.json(inPath).as [MyClass]
operations.repartition(partitions,operations(id))。write.parquet(s3a:// foo)
是否有任何明显的配置调整来解决这个问题?我的配置如下:
spark.driver.maxResultSize 6G
spark.driver.memory 10G
spark.executor.extraJavaOptions -XX:+ UseG1GC -XX:MaxPermSize = 1G -XX:+ HeapDumpOnOutOfMemoryError
spark.executor.memory 32G
spark.hadoop.fs.s3a.buffer.dir / raid0 / spark
spark.hadoop.fs.s3n.buffer.dir / raid0 / spark
spark.hadoop.fs.s3n.multipart.uploads.enabled true
spark.hadoop.parquet.block.size 2147483648
spark.hadoop.parquet.enable.summary-metadata false
spark.local.dir / raid0 / spark
spark.memory.fraction 0.8
spark.mesos.coarse true
spark.mesos.constraints优先级:1
spark.mesos.executor.memoryOverhead 16000
spark.network.timeout 600
spark.rpc.message.maxSize 1000
spark.speculation false
spark.sql.parquet.mergeSchema false
spark.sql.planner.externalSort true
spark.submit.deployMode客户端
spark.task.cpus 1
解决
spark.executor.extraJavaOptions -XX :+ UseG1GC -XX:InitiatingHeapOccupancyPercent = 35 -XX:ConcGCThreads = 12
我认为这需要一个相当数量的调整,但。 此 databricks文章非常有帮助。
I'm running Spark 2
and am trying to shuffle around 5 terabytes of json. I'm running into very long garbage collection pauses during shuffling of a Dataset
:
val operations = spark.read.json(inPath).as[MyClass]
operations.repartition(partitions, operations("id")).write.parquet("s3a://foo")
Are there any obvious configuration tweaks to deal with this issue? My configuration is as follows:
spark.driver.maxResultSize 6G
spark.driver.memory 10G
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError
spark.executor.memory 32G
spark.hadoop.fs.s3a.buffer.dir /raid0/spark
spark.hadoop.fs.s3n.buffer.dir /raid0/spark
spark.hadoop.fs.s3n.multipart.uploads.enabled true
spark.hadoop.parquet.block.size 2147483648
spark.hadoop.parquet.enable.summary-metadata false
spark.local.dir /raid0/spark
spark.memory.fraction 0.8
spark.mesos.coarse true
spark.mesos.constraints priority:1
spark.mesos.executor.memoryOverhead 16000
spark.network.timeout 600
spark.rpc.message.maxSize 1000
spark.speculation false
spark.sql.parquet.mergeSchema false
spark.sql.planner.externalSort true
spark.submit.deployMode client
spark.task.cpus 1
解决方案
Adding the following flags got rid of the GC pauses.
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
I think it does take a fair amount of tweaking though. This databricks post was very very helpful.
这篇关于Spark:混洗操作,导致GC停顿时间过长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文