从 Dataflow 工作节点返回大型数据结构,陷入序列化图 [英] Returning a large data structure from Dataflow worker node, getting stuck in serializing graph

查看:24
本文介绍了从 Dataflow 工作节点返回大型数据结构,陷入序列化图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 DoFn 函数中构建了大约 10 万个顶点和大约 100 万条边的大图.当我尝试在 DoFn 函数中输出该图时,执行卡在 c.output(graph);.

I have large graph ~100k vertices and ~1 million edges being constructed in a DoFn function. When I try to output that graph in DoFn function execution gets stuck at c.output(graph);.

    public static class Prep extends DoFn<TableRow, TableRows> {

        @Override
        public void processElement(ProcessContext c) {
            //Graph creation logic runs very fast, no problem here

            LOG.info("Starting Graph Output");  // can see this in logs
            c.output(graph); //outputs data from DoFn function
            LOG.info("Ending Graph Output"); // never see this logs
    }
  }

我的图形类只是一个使用 AvroCoder 序列化的顶点映射.

My graph class is just a Map of vertices being serialized with AvroCoder.

import org.apache.avro.reflect.Nullable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.X.Prep;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.coders.DefaultCoder;

//Class that creates Graph data structure for custom seg definitions 
@DefaultCoder(AvroCoder.class)
public class MyGraph {
  @Nullable
  public Map<String,GraphVertex> vertexList = new HashMap<String,GraphVertex>(); 
}

我尝试了 json-simple、gson、jackson json 序列化,所有这些都需要很长时间才能序列化这个图.

I have tried json-simple, gson, jackson json serialization all of them take too long to serialize this graph.

推荐答案

图形对象可能太大而无法编码并作为元素传递.您应该探索将图表提供给工作人员的其他机制.例如,创建一个多映射值的侧输入(由顶点键控).这将允许您拥有一个 PCollection(并行处理).

The graph object is likely too large to be encoded and passed around as an element. You should explore other mechanisms for getting the graph to workers. For example, creating a multi-map-valued side input (keyed by vertex). This would allow you to have a PCollection (processed in parallel).

或者,由于图创建逻辑运行速度非常快,只需在每个工作线程上运行该逻辑,而不是尝试序列化整个图.

Alternatively, since the graph creation logic runs very fast just run that logic on each worker, rather than trying to serialize the entire graph.

这篇关于从 Dataflow 工作节点返回大型数据结构,陷入序列化图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆