Avro 到 BigTable - 架构问题? [英] Avro to BigTable - Schema issue?
问题描述
我正在尝试使用 Dataflow 模板 [1] 将 Avro 文件(由 Spark 3.0 生成)摄取到 BigTable 中,并得到以下错误.
I'm trying to ingest an Avro file (produced with Spark 3.0) into BigTable using the Dataflow template [1], and get below error.
注意这个文件可以在 Spark 和 Python avro
库中读取,没有明显问题.
N.B. This file can be read in Spark and Python avro
library without apparent issue.
有什么想法吗?
感谢您的支持!
错误(简短)
Caused by: org.apache.avro.AvroTypeException: Found topLevelRecord, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
Avro 架构(提取)
{"type":"record","name":"topLevelRecord","fields":[{"name":"a_a","type":[字符串"、空"]}、...]}
错误(完整)
java.io.IOException: Failed to start reading from source: gs://myfolder/myfile.avro range [0, 15197631)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start (WorkerCustomSources.java:610)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start (ReadOperation.java:361)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop (ReadOperation.java:194)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start (ReadOperation.java:159)
at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute (MapTaskExecutor.java:77)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork (BatchDataflowWorker.java:417)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork (BatchDataflowWorker.java:386)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork (BatchDataflowWorker.java:311)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork (DataflowBatchWorkerHarness.java:140)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call (DataflowBatchWorkerHarness.java:120)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call (DataflowBatchWorkerHarness.java:107)
at java.util.concurrent.FutureTask.run (FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
at java.lang.Thread.run (Thread.java:834)
Caused by: org.apache.avro.AvroTypeException: Found topLevelRecord, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
at org.apache.avro.io.ResolvingDecoder.doAction (ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance (Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder (ResolvingDecoder.java:130)
at org.apache.avro.generic.GenericDatumReader.readRecord (GenericDatumReader.java:215)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion (GenericDatumReader.java:175)
at org.apache.avro.generic.GenericDatumReader.read (GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.read (GenericDatumReader.java:145)
at org.apache.beam.sdk.io.AvroSource$AvroBlock.readNextRecord (AvroSource.java:644)
at org.apache.beam.sdk.io.BlockBasedSource$BlockBasedReader.readNextRecord (BlockBasedSource.java:210)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl (FileBasedSource.java:484)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl (FileBasedSource.java:479)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start (OffsetBasedSource.java:249)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start (WorkerCustomSources.java:607)
参考文献:
[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#avrofiletocloudbigtable
推荐答案
BigTable 是一个可扩展的 NoSQL 数据库服务,这意味着它是无模式的;而 Spark SQL 具有您在问题中指出的架构.
BigTable is a scalable NoSQL database service, which means is schema-free; whereas Spark SQL has a schema as you indicated on your question.
从下面的错误来看,它指的是 BigTable row key
From the below error, it's referring you to BigTable row key
expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
因此,您需要按照此流程创建您的 BigTable 架构设计.
由于 HBase 也是无模式的,您的用例可以通过使用 Bigtable 和 HBase API,如果您可以灵活地使用 Spark 2.4.0
Since HBase is also schema-free, your use case could be solved by using Bigtable and the HBase API, if you're flexible to use Spark 2.4.0
就上述用例而言,它看起来是一个有效的功能请求,我会将其提交给产品团队并使用报告编号向您更新.
As for the above use case, it looks to be a valid feature request, which I would file to the product team and update you with the report number.
这篇关于Avro 到 BigTable - 架构问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!