将大文件从Google BigQuery传输到Google云端存储 [英] Transfer large file from Google BigQuery to Google Cloud Storage

查看:334
本文介绍了将大文件从Google BigQuery传输到Google云端存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用csv格式将BigQuery,2B记录中的大表转移到云存储。我正在使用控制台进行传输。



由于文件的大小,我需要指定一个包含*的URI来分割导出。我最终在云存储中使用了400个csv文件。每一个都有一个标题行。

这使得文件合并耗时很多,因为我需要将csv文件下载到另一台机器,去掉标题行,合并文件,然后重新上传。财政年度,合并的csv文件的大小约为48GB。



有没有更好的方法呢? >解决方案

使用API​​,您将能够告诉BigQuery不要在表提取期间打印标题行。这通过将 configuration.extract.printHeader 选项设置为 false 来完成。有关更多信息,请参阅文档。命令行实用程序也应该能够做到这一点。



完成此操作后,连接文件变得更容易。在Linux / Mac计算机中,它将是一个 cat 命令。但是,您也可以尝试通过使用撰写操作直接从<云>存储连接在此处查看更多详情。可以从API或命令行实用程序执行组合。



由于合成操作仅限于32个组件,所以必须在32个文件后编写32个文件。这应该对400个文件进行大约13次组合操作。请注意,我从来没有尝试过构图操作,所以我只是猜测这部分。


I need to transfer a large table in BigQuery, 2B records, to Cloud Storage with csv format. I am doing the transfer using the console.

I need to specify a uri including a * to shard the export due to the size of the file. I end up with 400 csv files in Cloud Storage. Each has a header row.

This makes combining the files time consuming, since I need to download the csv files to another machine, strip out the header rows, combine the files, and then re-upload. FY the size of the combined csv file is about 48GB.

Is there a better approach for this?

解决方案

Using the API, you will be able to tell BigQuery not to print the header row during the table extraction. This is done by setting the configuration.extract.printHeader option to false. See the documentation for more info. The command-line utility should also be able to do that.

Once you've done this, concatenating the files is much easier. In a Linux/Mac computer it would be a single cat command. However, you could also try to concatenate directly from Cloud Storage by using the compose operation. See more details here. Composition can be performed either from the API or the command line utility.

Since composition actions is limited to 32 components, you will have to compose 32 files after 32 files. That should make around 13 composition operations for 400 files. Note that I have never tried the composition operation, so I'm just guessing on this part.

这篇关于将大文件从Google BigQuery传输到Google云端存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆