如何修复Dataflow无法序列化我的DoFn? [英] How to fix Dataflow unable to serialize my DoFn?

查看:115
本文介绍了如何修复Dataflow无法序列化我的DoFn?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行数据流管道时,出现以下异常,抱怨我的DoFn无法序列化.我该如何解决?

When I run my Dataflow pipeline, I get the exception below complaining that my DoFn can't be serialized. How do I fix this?

这是堆栈跟踪:

Caused by: java.lang.IllegalArgumentException: unable to serialize contrail.dataflow.AvroMRTransforms$AvroReducerDoFn@bba0fc2
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:51)
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.ensureSerializable(SerializableUtils.java:81)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.ensureSerializable(DirectPipelineRunner.java:784)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.evaluateHelper(ParDo.java:1025)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.evaluateSingleHelper(ParDo.java:963)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.access$000(ParDo.java:441)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$1.evaluate(ParDo.java:951)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$1.evaluate(ParDo.java:946)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:611)
    at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:200)
    at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:196)
    at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:109)
    at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:204)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:584)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:328)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:70)
    at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:145)
    at contrail.stages.DataflowStage.stageMain(DataflowStage.java:51)
    at contrail.stages.NonMRStage.execute(NonMRStage.java:130)
    at contrail.stages.NonMRStage.run(NonMRStage.java:157)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at contrail.stages.ValidateGraphDataflow.main(ValidateGraphDataflow.java:139)
    ... 6 more
Caused by: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:47)
    ... 27 more

推荐答案

如果滚动浏览堆栈跟踪,则原因之一清楚地标识了不可序列化的数据.

If you scroll through the stack trace, one of the causes clearly identifies the data that isn't serializable.

Caused by: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf

问题是我的DoFn正在构造函数中使用JobConf实例并将其存储在实例变量中.我以为JobConf是可序列化的,但事实并非如此.

The problem was my DoFn was taking a JobConf instance in the constructor and storing it in an instance variable. I was assuming JobConf was serializable but it turns out it isn't.

为解决这个问题,我做了以下事情

To solve this I did the following

  • 我将JobConf成员变量标记为临时变量,这样它就不会被序列化.
  • 我创建了一个单独的类型为byte []的变量来存储JobConf的序列化版本
  • 在我的构造函数中,我将JobConf序列化为byte []并将其存储在实例变量中.
  • 我覆盖了startBundle并从字节中反序列化了JobConf []

这是我的DoFn的要点.

这篇关于如何修复Dataflow无法序列化我的DoFn?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆