Bigtable CSV导入 [英] Bigtable CSV import

查看:107
本文介绍了Bigtable CSV导入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要导入Google Bigtable的多个文件(存储在存储桶中)中有一个较大的csv数据集(> 5TB).文件格式为:

I have a large csv dataset (>5TB) in multiple files (stored in a storage bucket) that I need to import into Google Bigtable. The files are in the format:

行键,s1,s2,s3,s4
文字,整数,整数,整数,整数
...

rowkey,s1,s2,s3,s4
text,int,int,int,int
...

有一个带有hbase的importtsv函数,它很完美,但是在Windows中使用Google hbase shell时似乎不可用.可以使用此工具吗?如果不是,最快的方法是什么?我对hbase和Google Cloud的经验很少,因此举一个简单的例子就不错了.我已经看到了一些使用DataFlow的类似示例,但除非有必要,否则不愿学习该方法.

There is an importtsv function with hbase that would be perfect but this does not seem to be available when using Google hbase shell in windows. Is it possible to use this tool? If not, what is the fastest way of achieving this? I have little experience with hbase and Google Cloud so a simple example would be great. I have seen some similar examples using DataFlow but would prefer not to learn how to do this unless necessary.

谢谢

推荐答案

将这么大的东西导入Cloud Bigtable的理想方法是将TSV放在

The ideal way to import something this large into Cloud Bigtable is to put your TSV on Google Cloud Storage.

  • gsutil mb <your-bucket-name>
  • gsutil -m cp -r <source dir> gs://<your-bucket-name>/
  • gsutil mb <your-bucket-name>
  • gsutil -m cp -r <source dir> gs://<your-bucket-name>/

然后使用 Cloud Dataflow .

  1. 使用 HBase shell 创建表,列族,以及输出列.

  1. Use the HBase shell to create the table, Column Family, and the output columns.

写一个小的Dataflow作业以读取所有文件,然后创建一个密钥,然后写入表. (请参阅示例入门) .)

Write a small Dataflow job to read all the files, then create a key, followed by writing the table. (See this example to get started.)

一种更简单的方法是:(未经测试)

A bit easier way would be to: (Note- untested)

  • Copy your files to Google Cloud Storage
  • Use Google Cloud Dataproc the example shows how to create a cluster and hookup Cloud Bigtable.
  • ssh to your cluster master - the script in the wordcount-mapreduce example will accept ./cluster ssh
  • Use the HBase TSV importer to start a Map Reduce job.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> gs://<your-bucket-name>/<dir>/**

这篇关于Bigtable CSV导入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆