Apache NiFi - OutOfMemory错误:在SplitText处理器上超出了GC开销限制 [英] Apache NiFi - OutOfMemory Error: GC overhead limit exceeded on SplitText processor

查看:1218
本文介绍了Apache NiFi - OutOfMemory错误:在SplitText处理器上超出了GC开销限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用NiFi来处理使用HDF 1.2的大型CSV文件(每个文件可能有数十亿条记录)。我已经实现了我的流程,并且对于小文件一切正常。



问题是如果我尝试将文件大小推到100MB(1M记录),我从负责将文件拆分为单个记录的 SplitText 处理器获取 java.lang.OutOfMemoryError:超过GC开销限制。我已经搜索了这一点,它基本上意味着垃圾收集器执行时间太长而没有获得太多的堆空间。我预计这意味着太多的流文件正在生成太快。



我该如何解决这个问题?我试过改变nifi的配置,关于最大堆空间和其他与内存相关的属性,但似乎没有任何效果。



现在我添加了一个中间 SplitText 的行数为1K,这使我可以避免这个错误,但是我不认为这是一个可靠的解决方案,当传入的文件大小可能远远超过这个数量时,我恐怕我会得到来自处理器的相同行为。



欢迎任何建议!谢谢

解决方案

出现此错误的原因是,当行数为1的1M记录分裂时,您正在创建1M个流文件它等于1M个Java对象。总的来说,使用两个SplitText处理器的方法很常见,并且避免了同时创建所有对象。你也许可以在第一次拆分时使用更大的拆分大小,也许10k。对于十亿条记录,我想知道第三个级别是否合理,从1B到10M,然后是10M到10K,然后是10K到1,但是我必须要玩。

需要考虑的一些额外事项是将默认堆大小从512MB增加到您可能已经完成的大小,并确定是否真的需要分割为1行。如果不知道流程的其他信息,很难说,但是在很多情况下,如果您想要将每条生产线交付到某处,则可能会有一个处理器读取大分隔文件并将每条生产线流到目标位置。例如,这是PutKafka和PutSplunk的工作方式,他们可以使用1M行的文件并将每行传送到目的地。


I am trying to use NiFi to process large CSV files (potentially billions of records each) using HDF 1.2. I've implemented my flow, and everything is working fine for small files.

The problem is that if I try to push the file size to 100MB (1M records) I get a java.lang.OutOfMemoryError: GC overhead limit exceeded from the SplitText processor responsible of splitting the file into single records. I've searched for that, and it basically means that the garbage collector is executed for too long without obtaining much heap space. I expect this means that too many flow files are being generated too fast.

How can I solve this? I've tried changing nifi's configuration regarding the max heap space and other memory-related properties, but nothing seems to work.

Right now I added an intermediate SplitText with a line count of 1K and that allows me to avoid the error, but I don't see this as a solid solution for when the incoming file size will become potentially much more than that, I am afraid I will get the same behavior from the processor.

Any suggestion is welcomed! Thank you

解决方案

The reason for the error is when splitting 1M records with a line count of 1, you are creating 1M flow files which equate 1M Java objects. Overall the approach of using two SplitText processors is common and avoids creating all of the objects at the same time. You could probably use an even larger split size on the first split, maybe 10k. For a billion records I am wondering if a third level would make sense, split from 1B to maybe 10M, then 10M to 10K, then 10K to 1, but I would have to play with it.

Some additional things to consider are increasing the default heap size from 512MB, which you may have already done, and also figuring out if you really need to split down to 1 line. It is hard to say without knowing anything else about the flow, but in a lot of cases if you want to deliver each line somewhere you could potentially have a processor that reads in a large delimited file and streams each line to the destination. For example, this is how PutKafka and PutSplunk work, they can take a file with 1M lines and stream each line to the destination.

这篇关于Apache NiFi - OutOfMemory错误:在SplitText处理器上超出了GC开销限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆