Flink taskmanager 内存不足和内存配置 [英] Flink taskmanager out of memory and memory configuration

查看:102
本文介绍了Flink taskmanager 内存不足和内存配置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用 Flink 流在单个集群上运行一些作业.我们的工作是使用rocksDB 来保存状态.集群配置为在 3 个独立的 VM 上使用单个 Jobmanager 和 3 个 Taskmanager 运行.每个 TM 都配置为使用 14GB 的 RAM 运行.JM 配置为使用 1GB 运行.

We are using Flink streaming to run a few jobs on a single cluster. Our jobs are using rocksDB to hold a state. The cluster is configured to run with a single Jobmanager and 3 Taskmanager on 3 separate VMs. Each TM is configured to run with 14GB of RAM. JM is configured to run with 1GB.

我们遇到了 2 个与内存相关的问题:- 当运行具有 8GB 堆分配的 Taskmanager 时,TM 耗尽了堆内存,我们得到了堆内存不足的异常.我们对这个问题的解决方案是将堆大小增加到 14GB.似乎这个配置解决了这个问题,因为我们不再因为堆内存不足而崩溃.- 尽管如此,在将堆大小增加到 14GB(每个 TM 进程)后,操作系统内存不足并杀死 TM 进程.RES 内存随着时间的推移不断增加,每个 TM 进程达到约 20GB.

We are experiencing 2 memory related issues: - When running Taskmanager with 8GB heap allocation, the TM ran out of heap memory and we got heap out of memory exception. Our solution to this problem was increasing heap size to 14GB. Seems like this configuration solved the issue, as we no longer crash due to out of heap memory. - Still, after increasing heap size to 14GB (per TM process) OS runs out of memory and kills the TM process. RES memory is rising over time and reaching ~20GB per TM process.

1.问题是我们如何预测最大物理内存总量和堆大小配置?

2.由于我们的内存问题,使用非默认值的 Flink 托管内存是否合理?在这种情况下,指导方针是什么?

更多详情:每个 Vm 配置有 4 个 CPU 和 24GB 的 RAM使用 Flink 版本:1.3.2

Further details: Each Vm is configured with 4 CPUs and 24GB of RAM Using Flink version: 1.3.2

推荐答案

所需的物理内存和堆内存总量很难计算,因为它在很大程度上取决于您的用户代码、作业的拓扑结构以及您使用的状态后端.

The total amount of required physical and heap memory is quite difficult to compute since it strongly depends on your user code, your job's topology and which state backend you use.

根据经验,如果您遇到 OOM 并且仍在使用 FileSystemStateBackendMemoryStateBackend,那么您应该切换到 RocksDBStateBackend,因为如果状态变得太大,它可以优雅地溢出到磁盘.

As a rule of thumb, if you experience OOM and are still using the FileSystemStateBackend or the MemoryStateBackend, then you should switch to RocksDBStateBackend, because it can gracefully spill to disk if the state grows too big.

如果您仍然遇到您所描述的 OOM 异常,那么您应该检查您的用户代码是否保留对状态对象的引用或以其他方式生成无法被垃圾回收的大对象.如果是这种情况,那么你应该尝试重构你的代码以依赖 Flink 的状态抽象,因为使用 RocksDB 它可以脱离核心.

If you are still experiencing OOM exceptions as you have described, then you should check your user code whether it keeps references to state objects or generates in some other way large objects which cannot be garbage collected. If this is the case, then you should try to refactor your code to rely on Flink's state abstraction, because with RocksDB it can go out of core.

RocksDB 本身需要原生内存,这增加了 Flink 的内存占用.这取决于块缓存大小、索引、布隆过滤器和内存表.您可以在此处.

RocksDB itself needs native memory which adds to Flink's memory footprint. This depends on the block cache size, indexes, bloom filters and memtables. You can find out more about these things and how to configure them here.

最后但并非最不重要的一点是,在运行流式作业时不应激活 taskmanager.memory.preallocate,因为流式作业当前不使用托管内存.因此,通过激活预分配,您将为 Flink 的托管内存分配内存,从而减少可用的堆空间.

Last but not least, you should not activate taskmanager.memory.preallocate when running streaming jobs, because streaming jobs currently don't use managed memory. Thus, by activating preallocation, you would allocate memory for Flink's managed memory which is reduces the available heap space.

这篇关于Flink taskmanager 内存不足和内存配置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆