如何批量加载从其他来源生成的自定义Avro数据? [英] How to batch load custom Avro data generated from another source?

查看:90
本文介绍了如何批量加载从其他来源生成的自定义Avro数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Cloud Spanner文档说Spanner可以导出/导入Avro格式.该路径还可以用于批量摄取从其他来源生成的Avro数据吗?文档似乎建议它只能导入由Spanner生成的Avro数据.

The Cloud Spanner docs say that Spanner can export/import Avro format. Can this path also be used for batch ingestion of Avro data generated from another source? The docs seem to suggest it can only import Avro data that was also generated by Spanner.

我运行了一个快速导出作业,并查看了生成的文件.清单和架构看起来非常简单.我想我会在这里张贴,以防这个兔子洞很深.

I ran a quick export job and took a look at the generated files. The manifest and schema look pretty straight forward. I figured I would post here in case this rabbit hole is deep.

清单文件

'

{
  "files": [{
    "name": "people.avro-00000-of-00001",
    "md5": "HsMZeZFnKd06MVkmiG42Ag=="
  }]
}

模式文件

{
  "tables": [{
    "name": "people",
    "manifestFile": "people-manifest.json"
  }]
}

数据文件

    {"type":"record",
    "name":"people",
    "namespace":
    "spannerexport","
    fields":[
{"name":"fullName",
"type":["null","string"],
"sqlType":"STRING(MAX)"},{"name":"memberId",
"type":"long",
"sqlType":"INT64"}
],
    "googleStorage":"CloudSpanner",
    "spannerPrimaryKey":"`memberId` ASC",
    "spannerParent":"",
    "spannerPrimaryKey_0":"`memberId` ASC",
    "googleFormatVersion":"1.0.0"}    

推荐答案

回答您的问题,是的!有两种方法可以将Avro数据提取到Cloud Spanner中.

In response to your question, yes! There are two ways to do ingestion of Avro data into Cloud Spanner.

方法1

如果将Avro文件放置在按Cloud Spanner导出操作将对其进行排列的Google Cloud Storage存储桶中,并且生成的清单格式与Cloud Spanner期望的一样,则可以使用Web Spanner Web界面中的导入功能.显然,这里可能有很多繁琐的格式化工作,这就是为什么官方文档的原因指出此导入过程仅支持从Cloud Spanner导出的Avro文件".

If you place Avro files in a Google Cloud Storage bucket arranged as a Cloud Spanner export operation would arrange them and you generate a manifest formatted as Cloud Spanner expects, then using the import functionality in the web interface for Cloud Spanner will work. Obviously, there may be a lot of tedious formatting work here which is why the official documentation states that this "import process supports only Avro files exported from Cloud Spanner".

方法2

与其使用Cloud Spanner Web控制台执行导入/导出作业,而不是依靠Avro清单和数据文件进行完美格式化,而是在Google Cloud Platform用户下,稍微修改GitHub上两个公共代码存储库中的任何一个中的代码提供导入/导出(或备份/还原或导出/导入)功能,用于将数据从Avro格式移动到Google Cloud Spanner:(1) Pontem ,尤其是

Instead of executing the import/export job using the Cloud Spanner web console and relying on the Avro manifest and data files to be perfectly formatted, slightly modify the code in either of two public code repositories on GitHub under the Google Cloud Platform user that provide import/export (or backup/restore or export/ingest) functionality for moving data from Avro format into Google Cloud Spanner: (1) Dataflow Templates, especially this file (2) Pontem, especially this file.

这两个工具均写入了Dataflow作业,使您可以使用Avro格式将数据移入和移出Cloud Spanner.每种都有一种解析输入的Avro模式的特定方法(即,将数据从Avro移至Cloud Spanner).由于已输入用例(即,将数据导入Avro格式的Cloud Spanner中),因此您需要修改Avro解析代码以适合您的特定架构,然后从本地计算机上的命令行执行Cloud Dataflow作业(然后将作业上传到Google Cloud Platform).

Both of these have Dataflow jobs written that allow you to move data into and out of Cloud Spanner using the Avro format. Each has a specific means of parsing an Avro schema for input (i.e., moving data from Avro into Cloud Spanner). Since your use-case is input (i.e., ingesting data into Cloud Spanner that is Avro-formatted), you need to modify the Avro parsing code to fit your specific schema and then execute the Cloud Dataflow job from the commandline locally on your machine (the job is then uploaded to Google Cloud Platform).

如果您不熟悉Cloud Dataflow,它是用于定义和运行具有大数据集的作业的工具.

If you are not familiar with Cloud Dataflow, it is a tool for defining and running jobs with large data sets.

这篇关于如何批量加载从其他来源生成的自定义Avro数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆