从Dataflow工作程序节点返回大型数据结构,陷入序列化图表 [英] Returning a large data structure from Dataflow worker node, getting stuck in serializing graph
问题描述
我在DoFn
函数中构造了约100k个顶点和约100万条边的大型图形.当我尝试在DoFn函数中输出该图时,执行卡在c.output(graph);
处.
I have large graph ~100k vertices and ~1 million edges being constructed in a DoFn
function. When I try to output that graph in DoFn function execution gets stuck at c.output(graph);
.
public static class Prep extends DoFn<TableRow, TableRows> {
@Override
public void processElement(ProcessContext c) {
//Graph creation logic runs very fast, no problem here
LOG.info("Starting Graph Output"); // can see this in logs
c.output(graph); //outputs data from DoFn function
LOG.info("Ending Graph Output"); // never see this logs
}
}
我的图类只是一个用AvroCoder序列化的顶点映射.
My graph class is just a Map of vertices being serialized with AvroCoder.
import org.apache.avro.reflect.Nullable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.X.Prep;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.coders.DefaultCoder;
//Class that creates Graph data structure for custom seg definitions
@DefaultCoder(AvroCoder.class)
public class MyGraph {
@Nullable
public Map<String,GraphVertex> vertexList = new HashMap<String,GraphVertex>();
}
我已经尝试过json-simple,gson,jackson json序列化,所有这些都花费了很长时间才能序列化该图.
I have tried json-simple, gson, jackson json serialization all of them take too long to serialize this graph.
推荐答案
该图对象可能太大,无法作为元素进行编码和传递.您应该探索将图表传递给工作人员的其他机制.例如,创建一个多地图值的侧面输入(由顶点键控).这将使您拥有一个PCollection(并行处理).
The graph object is likely too large to be encoded and passed around as an element. You should explore other mechanisms for getting the graph to workers. For example, creating a multi-map-valued side input (keyed by vertex). This would allow you to have a PCollection (processed in parallel).
或者,由于图创建逻辑运行非常快,因此只需在每个工作进程上运行该逻辑,而不是尝试序列化整个图.
Alternatively, since the graph creation logic runs very fast just run that logic on each worker, rather than trying to serialize the entire graph.
这篇关于从Dataflow工作程序节点返回大型数据结构,陷入序列化图表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!