将 hive 表迁移到 Google BigQuery [英] Migrate hive table to Google BigQuery

查看:26
本文介绍了将 hive 表迁移到 Google BigQuery的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试设计一种数据管道来将我的 Hive 表迁移到 BigQuery.Hive 运行在本地集群的 Hadoop 上.这是我目前的设计,其实很简单,就是一个shell脚本:

I am trying to design a sort of data pipeline to migrate my Hive tables into BigQuery. Hive is running on an Hadoop on premise cluster. This is my current design, actually, it is very easy, it is just a shell script:

对于每个表 source_hive_table {

for each table source_hive_table {

  • INSERT 覆盖表 target_avro_hive_table SELECT * FROM source_hive_table;
  • 使用 distcp
  • 将生成的 avro 文件移动到谷歌云存储中
  • 创建第一个 BQ 表:bq load --source_format=AVRO your_dataset.something something.avro
  • 处理来自 BigQuery 本身的任何转换问题,因此从刚刚编写的表中选择并手动处理任何转换

}

你觉得有道理吗?有没有更好的方法,也许使用 Spark?我对处理转换的方式不满意,我想避免两次创建 BigQuery 表.

Do you think it makes sense? Is there any better way, perhaps using Spark? I am not happy about the way I am handling the casting, I would like to avoid creating the BigQuery table twice.

推荐答案

是的,您的迁移逻辑很有意义.

Yes, your migration logic makes sense.

我个人更喜欢将特定类型的 CAST 直接执行到生成 Avro (Hive) 数据的初始Hive 查询"中.例如,Hive 中的 "decimal" 类型映射到 Avro 'type':"type":"bytes","logicalType":"decimal","precision":10,"规模":2

I personally prefer to do the CAST for specific types directly into the initial "Hive query" that generates your Avro (Hive) data. For instance, "decimal" type in Hive maps to the Avro 'type': "type":"bytes","logicalType":"decimal","precision":10,"scale":2

并且 BQ 将只采用主要类型(这里是字节")而不是逻辑类型.所以这就是为什么我发现在 Hive 中直接投射更容易(这里是加倍").同样的问题也发生在 date-hive 类型上.

And BQ will just take the primary type (here "bytes") instead of the logicalType. So that is why I find it easier to cast directly in Hive (here to "double"). Same problem happens to the date-hive type.

这篇关于将 hive 表迁移到 Google BigQuery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆