云数据流 - 增加JVM Xmx值 [英] Cloud Dataflow - Increase JVM Xmx Value
问题描述
我们正试图在云端运行Google Cloud Dataflow作业,但我们一直在收到java.lang.OutOfMemoryError:Java heap space。
我们正在尝试处理大查询表中的6.1亿条记录,并将处理后的记录写入12个不同的输出(主+ 11侧输出)。
我们尝试将实例的数量增加到64个n1-standard-4实例,但我们仍然遇到问题。
工作ID是 - 2015-06-11_21_32_32-16904087942426468793
您无法直接设置堆大小。但是,Dataflow会根据机器类型缩放堆大小。您可以通过设置标志--machineType来选择一台拥有更多内存的机器。数据流有意限制堆大小以避免对洗牌机产生负面影响。
b
您的代码是否显式地从内存中的多个记录累积值?你预计4GB对于任何给定的记录是不够的吗?
数据流的内存需求应该随着单个记录的大小和你的代码在内存中缓冲的数据量而扩展。数据流的内存要求不应随记录数量而增加。
We are trying to run a Google Cloud Dataflow job in the cloud but we keep getting "java.lang.OutOfMemoryError: Java heap space".
We are trying to process 610 million records from a Big Query table and writing the processed records to 12 different outputs (main + 11 side outputs).
We have tried increasing our number of instances to 64 n1-standard-4 instances but we are still getting the issue.
The Xmx value on the VMs seem to be set at ~4GB(-Xmx3951927296), even though the instances have 15GB memory. Is there any way of increasing the Xmx Value?
The job ID is - 2015-06-11_21_32_32-16904087942426468793
You can't directly set the heap size. Dataflow, however, scales the heap size with the machine type. You can pick a machine with more memory by setting the flag "--machineType". The heap size should increase linearly with the total memory of the machine type.
Dataflow deliberately limits the heap size to avoid negatively impacting the shuffler.
Is your code explicitly accumulating values from multiple records in memory? Do you expect 4GB to be insufficient for any given record?
Dataflow's memory requirements should scale with the size of individual records and the amount of data your code is buffering in memory. Dataflow's memory requirements shouldn't increase with the number of records.
这篇关于云数据流 - 增加JVM Xmx值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!