Avro 到 BigTable - 架构问题? [英] Avro to BigTable - Schema issue?

查看:26
本文介绍了Avro 到 BigTable - 架构问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Dataflow 模板 [1] 将 Avro 文件(由 Spark 3.0 生成)摄取到 BigTable 中,并得到以下错误.

I'm trying to ingest an Avro file (produced with Spark 3.0) into BigTable using the Dataflow template [1], and get below error.

注意这个文件可以在 Spark 和 Python avro 库中读取,没有明显问题.

N.B. This file can be read in Spark and Python avro library without apparent issue.

有什么想法吗?

感谢您的支持!

错误(简短)

Caused by: org.apache.avro.AvroTypeException: Found topLevelRecord, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key

Avro 架构(提取)

{"type":"record","name":"topLevelRecord","fields":[{"name":"a_a","type":[字符串"、空"]}、...]}

错误(完整)

java.io.IOException: Failed to start reading from source: gs://myfolder/myfile.avro range [0, 15197631)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start (WorkerCustomSources.java:610)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start (ReadOperation.java:361)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop (ReadOperation.java:194)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start (ReadOperation.java:159)
at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute (MapTaskExecutor.java:77)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork (BatchDataflowWorker.java:417)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork (BatchDataflowWorker.java:386)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork (BatchDataflowWorker.java:311)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork (DataflowBatchWorkerHarness.java:140)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call (DataflowBatchWorkerHarness.java:120)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call (DataflowBatchWorkerHarness.java:107)
at java.util.concurrent.FutureTask.run (FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
at java.lang.Thread.run (Thread.java:834)
Caused by: org.apache.avro.AvroTypeException: Found topLevelRecord, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
at org.apache.avro.io.ResolvingDecoder.doAction (ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance (Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder (ResolvingDecoder.java:130)
at org.apache.avro.generic.GenericDatumReader.readRecord (GenericDatumReader.java:215)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion (GenericDatumReader.java:175)
at org.apache.avro.generic.GenericDatumReader.read (GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.read (GenericDatumReader.java:145)
at org.apache.beam.sdk.io.AvroSource$AvroBlock.readNextRecord (AvroSource.java:644)
at org.apache.beam.sdk.io.BlockBasedSource$BlockBasedReader.readNextRecord (BlockBasedSource.java:210)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl (FileBasedSource.java:484)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl (FileBasedSource.java:479)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start (OffsetBasedSource.java:249)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start (WorkerCustomSources.java:607)

参考文献:

[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#avrofiletocloudbigtable

推荐答案

BigTable 是一个可扩展的 NoSQL 数据库服务,这意味着它是无模式的;而 Spark SQL 具有您在问题中指出的架构.

BigTable is a scalable NoSQL database service, which means is schema-free; whereas Spark SQL has a schema as you indicated on your question.

从下面的错误来看,它指的是 BigTable row key

From the below error, it's referring you to BigTable row key

expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key

因此,您需要按照此流程创建您的 BigTable 架构设计.

由于 HBase 也是无模式的,您的用例可以通过使用 Bigtable 和 HBase API,如果您可以灵活地使用 Spark 2.4.0

Since HBase is also schema-free, your use case could be solved by using Bigtable and the HBase API, if you're flexible to use Spark 2.4.0

就上述用例而言,它看起来是一个有效的功能请求,我会将其提交给产品团队并使用报告编号向您更新.

As for the above use case, it looks to be a valid feature request, which I would file to the product team and update you with the report number.

这篇关于Avro 到 BigTable - 架构问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆