Spark结构化流检查点清理 [英] Spark Structured Streaming Checkpoint Cleanup

查看:154
本文介绍了Spark结构化流检查点清理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用结构化流从文件源中摄取数据.我有一个检查点设置,据我所知它可以正常工作,除了我不明白在几种情况下会发生什么.如果我的流媒体应用程序运行了很长时间,检查点文件将永远永久变大,或者最终被清除.如果永远不清理,这有关系吗?看来最终它将变得足够大,以至于程序需要花费很长时间来解析.

I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don't understand what will happen in a couple situations. If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up. And does it matter if it is never cleaned up? It seems that eventually it would become large enough that it would take a long time for the program to parse.

我的另一个问题是,当我手动删除或更改检查点文件夹,或更改为其他检查点文件夹时,不会吸收新文件.可以识别文件并将其添加到检查点,但实际上并未提取文件.这让我担心,如果以某种方式更改了检查点文件夹,那么我的提取就会搞砸了.在这些情况下,我无法找到有关正确程序的更多信息.

My other question is when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested. The files are recognized and are added to the checkpoint, but the file is not actually ingested. This has me worried that if somehow the checkpoint folder is altered my ingestion will screw up. I haven't been able to find much information on what the correct procedure is in these situations.

推荐答案

如果我的流媒体应用程序长时间运行,则检查点文件会 只是永远变得更大,还是最终被清理

If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up

结构化流保留有一个后台线程,该线程负责删除状态的快照和增量,因此,除非您的状态确实很大并且您拥有的空间很小,否则您不必担心它.可以配置重新训练的增量/快照Spark存储.

Structured Streaming keeps a background thread which is responsible for deleting snapshots and deltas of your state, so you shouldn't be concerned about it unless your state is really large and the amount of space you have is small, in which case you can configure the retrained deltas/snapshots Spark stores.

当我手动删除或更改检查点文件夹,或更改为 不同的检查点文件夹,不会提取新文件.

when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested.

我不太确定您的意思,但是仅在特殊情况下才应删除检查点数据.只要存储的数据类型是向后兼容的,结构化流就可以保持版本升级之间的状态.我真的没有很好的理由来更改检查点的位置或手动删除文件,除非发生了什么不好的事情.

I'm not really sure what you mean here, but you should only remove checkpointed data in special cases. Structured Streaming allows you to keep state between version upgrades as long as the stored data type is backwards compatible. I don't really see a good reason for altering the location of your checkpoint or deleting the files manually unless something bad happened.

这篇关于Spark结构化流检查点清理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆