将Spark检查点写入S3太慢 [英] Writing Spark checkpoints to S3 is too slow

查看:230
本文介绍了将Spark检查点写入S3太慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark Streaming 1.5.2,并且正在使用直接流方法从Kafka 0.8.2.2提取数据.

I'm using Spark Streaming 1.5.2 and I am ingesting data from Kafka 0.8.2.2 using the Direct Stream approach.

我已启用检查点,以便可以重新启动我的驱动程序,并在不丢失未经处理的数据的情况下从其中断的地方重新开始.

I have enabled the checkpoints so that my Driver can be restarted and pick up where it left off without loosing unprocessed data.

检查点将被写入S3,就像我在Amazon AWS上一样,而不是在Hadoop集群上运行.

Checkpoints are written to S3 as I'm on Amazon AWS and not running on top of a Hadoop cluster.

批处理间隔为1秒,因为我想降低延迟时间.

The batch interval is 1 second as I want a low latency.

问题是,将单个检查点写入S3需要1到20秒.它们正在内存中备份,最终应用程序失败.

Issue is, it takes from 1 to 20 seconds to write a single checkpoint to S3. They are backing up in memory and, eventually, the application fails.

2016-04-28 18:26:55,483 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6071 bytes and 1724 ms
2016-04-28 18:26:58,812 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6024 bytes and 3329 ms
2016-04-28 18:27:00,327 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6068 bytes and 1515 ms
2016-04-28 18:27:06,667 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6024 bytes and 6340 ms
2016-04-28 18:27:11,689 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6067 bytes and 5022 ms
2016-04-28 18:27:15,982 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6024 bytes and 4293 ms

是否可以在不增加批处理间隔的情况下增加检查点之间的间隔?

Is there a way to increase the interval between checkpoints without increasing the batch interval?

推荐答案

是的,您可以使用checkpointInterval参数来实现.您可以在执行检查点时设置持续时间,如下面的文档所示.

Yes, you can achieve that using checkpointInterval parameter. You can set the duration while doing checkpoint like shown in below doc.

请注意,RDD的检查点会导致保存到可靠存储的成本.这可能会导致RDD获得检查点的那些批次的处理时间增加.因此,需要仔细设置检查点的间隔.在小批量(例如1秒)时,每批检查点都可能会大大降低操作吞吐量.相反,检查点太不频繁会导致沿袭和任务规模增加,这可能会产生不利影响.对于需要RDD检查点的有状态转换,默认间隔为批处理间隔的倍数,至少应为10秒.可以使用dstream.checkpoint(checkpointInterval)进行设置.通常,尝试设置DStream的5到10个滑动间隔的检查点间隔是一个不错的选择.

Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.

这篇关于将Spark检查点写入S3太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆