Flink Taskmanager内存不足和内存配置 [英] Flink taskmanager out of memory and memory configuration

查看:187
本文介绍了Flink Taskmanager内存不足和内存配置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用Flink流在单个群集上运行一些作业.我们的工作是使用rocksDB来保持状态.群集配置为在3个单独的VM上与单个Jobmanager和3 Taskmanager一起运行.每个TM配置为运行14GB RAM.JM配置为以1GB运行.

We are using Flink streaming to run a few jobs on a single cluster. Our jobs are using rocksDB to hold a state. The cluster is configured to run with a single Jobmanager and 3 Taskmanager on 3 separate VMs. Each TM is configured to run with 14GB of RAM. JM is configured to run with 1GB.

我们遇到了2个与内存相关的问题:-运行具有8GB堆分配的Taskmanager时,TM耗尽了堆内存,并且出现了堆内存不足异常.我们针对此问题的解决方案是将堆大小增加到14GB.似乎这种配置解决了该问题,因为我们不再由于堆内存不足而崩溃.-尽管如此,在将堆大小增加到14GB(每个TM进程)之后,操作系统将耗尽内存并杀死TM进程.RES内存随着时间的增长而增加,每个TM进程达到约20GB.

We are experiencing 2 memory related issues: - When running Taskmanager with 8GB heap allocation, the TM ran out of heap memory and we got heap out of memory exception. Our solution to this problem was increasing heap size to 14GB. Seems like this configuration solved the issue, as we no longer crash due to out of heap memory. - Still, after increasing heap size to 14GB (per TM process) OS runs out of memory and kills the TM process. RES memory is rising over time and reaching ~20GB per TM process.

1.问题是我们如何预测物理内存和堆大小配置的最大总量?

2.由于我们的内存问题,使用Flink受管内存的非默认值是否合理?在这种情况下的准则是什么?

更多详细信息:每个Vm配置有4个CPU和24GB RAM使用Flink版本:1.3.2

Further details: Each Vm is configured with 4 CPUs and 24GB of RAM Using Flink version: 1.3.2

推荐答案

所需的物理内存和堆内存总量很难计算,因为它很大程度上取决于用户代码,作业的拓扑以及所使用的后端状态.

The total amount of required physical and heap memory is quite difficult to compute since it strongly depends on your user code, your job's topology and which state backend you use.

根据经验,如果您遇到OOM并且仍在使用 FileSystemStateBackend MemoryStateBackend ,则应切换到 RocksDBStateBackend ,因为如果状态变得太大,它可以正常地溢出到磁盘上.

As a rule of thumb, if you experience OOM and are still using the FileSystemStateBackend or the MemoryStateBackend, then you should switch to RocksDBStateBackend, because it can gracefully spill to disk if the state grows too big.

如果您仍然遇到如前所述的OOM异常,则应检查用户代码,它是否保留对状态对象的引用,还是以其他方式生成了无法垃圾收集的大型对象.如果是这种情况,那么您应该尝试重构代码以依赖Flink的状态抽象,因为使用RocksDB可能会超出核心.

If you are still experiencing OOM exceptions as you have described, then you should check your user code whether it keeps references to state objects or generates in some other way large objects which cannot be garbage collected. If this is the case, then you should try to refactor your code to rely on Flink's state abstraction, because with RocksDB it can go out of core.

RocksDB本身需要本机内存,这会增加Flink的内存占用量.这取决于块高速缓存的大小,索引,Bloom过滤器和内存表.您可以在此处中找到有关这些东西以及如何配置它们的更多信息.>.

RocksDB itself needs native memory which adds to Flink's memory footprint. This depends on the block cache size, indexes, bloom filters and memtables. You can find out more about these things and how to configure them here.

最后但并非最不重要的一点是,在运行流作业时,您不应激活 taskmanager.memory.preallocate ,因为流作业当前不使用托管内存.因此,通过激活预分配,您可以为Flink的托管内存分配内存,这会减少可用的堆空间.

Last but not least, you should not activate taskmanager.memory.preallocate when running streaming jobs, because streaming jobs currently don't use managed memory. Thus, by activating preallocation, you would allocate memory for Flink's managed memory which is reduces the available heap space.

这篇关于Flink Taskmanager内存不足和内存配置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆