状态数据不断增长时,Spark结构化流如何处理内存中状态? [英] How does Spark Structured Streaming handle in-memory state when state data is growing?

查看:90
本文介绍了状态数据不断增长时,Spark结构化流如何处理内存中状态?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark结构化流传输(版本2.2.0)中,如果使用mapGroupsWithState查询并将更新模式作为输出模式,则Spark似乎使用以下命令存储了内存中状态数据: java.util.ConcurrentHashMap数据结构.有人可以向我详细解释一下,当状态数据增长并且没有足够的内存时会发生什么?另外,是否可以使用spark config参数更改将状态数据存储在内存中的限制?

In Spark Structured Streaming (version 2.2.0), in case of using mapGroupsWithState query with Update mode as the output mode, It seems that Spark is storing the in-memory state data using java.util.ConcurrentHashMap data structure. Can someone explain to me in detail that what happens when the state data grows and there isn't enough memory anymore? Also, is it possible to change the limit for storing the state data in the memory, using a spark config parameter?

推荐答案

有人可以向我详细解释一下,当状态改变时会发生什么 数据不断增长,没有足够的内存了

Can someone explain to me in detail that what happens when the state data grows and there isn't enough memory anymore

执行程序将因OOM异常而崩溃.由于使用mapGroupWithState,您是负责添加和删除状态的人,如果您用无法为其分配内存的数据淹没了JVM,则该过程将崩溃.

The executor will crash with an OOM exception. Since with mapGroupWithState, you're the one in charge of adding and removing state, if you're overwhelming the JVM with data you can't allocate memory for, the process will crash.

是否可以更改存储状态数据的限制 内存,使用spark config参数?

is it possible to change the limit for storing the state data in the memory, using a spark config parameter?

不可能限制您在内存中存储的字节数.同样,如果这是mapGroupsWithState,则必须以不会导致JVM变为OOM的方式来管理状态,例如设置超时和删除状态.如果我们正在谈论状态化聚合,其中Spark可以为您管理状态,例如agg组合器,那么您可以使用

It isn't possible to limit the number of bytes you're storing in memory. Again, if this is mapGroupsWithState, you have to manage state in such a way that won't cause your JVM to OOM, such as setting timeouts and removing state. If we're talking about stateful aggregations where Spark manages the state for you, such as the agg combinator, then you can limit the state using a watermark which will evict old data from memory once the time frame passes.

这篇关于状态数据不断增长时,Spark结构化流如何处理内存中状态?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆