将数据从 BigQuery 导出到 GCS - 拆分文件大小控制 [英] Exporting data to GCS from BigQuery - Split file size control

查看:23
本文介绍了将数据从 BigQuery 导出到 GCS - 拆分文件大小控制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在将数据从 Bigquery 导出到 GCS 存储桶.我正在使用以下查询以编程方式执行此操作:

I am currently exporting data from Bigquery to GCS buckets. I am doing this programmatically using the following query:

query_request = bigquery_service.jobs()

DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';

DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
  'extract': {
    'sourceTable': {
            'projectId': PROJECT_ID,
            'datasetId': DATASET_ID,
            'tableId': #####,
     },
    'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
    'destinationFormat': 'CSV',
    'printHeader': 'false',
    'compression': 'GZIP'
   }
 }

}

query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
                                     body=query_data).execute()

由于限制每个文件只能导出 1GB 到 GCS,我使用了单个通配符 URI (https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple).这会将文件拆分为多个较小的部分.拆分后,每个文件部分也会被压缩.

Since there is a constraint that only 1GB per file can be exported to GCS, I used the single wildcard URI (https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple). This splits the file into multiple smaller parts. After splitting, each of the file parts are gzipped as well.

我的问题:我可以控制拆分文件的文件大小吗?例如,如果我有一个 14GB 的文件要导出到 GCS,这将被拆分为 14 个 1GB 的文件.但是有没有办法将 1GB 更改为另一种大小(在 gzip 压缩之前小于 1GB)?我检查了可用于修改 configuration.extract 对象的各种参数?(参考:https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract)

My question: Can I control the file sizes of the split files? For example, if I have a 14GB file to export to GCS, this will be split into 14 1GB files. But is there a way to change that 1GB into another size (smaller than 1GB before gzipping)? I checked the various parameters that are available for modifying the configuration.extract object? (Refer: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract)

推荐答案

如果您指定多个 URI 模式,数据将在它们之间进行分片.因此,如果您使用 28 个 URI 模式,则每个分片大约为半 GB.对于每个模式,您最终会得到大小为零的第二个文件,因为这确实适用于 MR 作业,但它是完成您想要的工作的一种方式.

If you specify multiple URI patterns, the data will be sharded between them. So if you used, say, 28 URI patterns, each shard would be about half a GB. You'd end up with second files of size zero for each pattern, as this is really meant for MR jobs, but its one way to accomplish what you want.

此处的更多信息(请参阅多通配符 URI 部分):从 BigQuery 导出数据

More info here (see the Multiple Wildcard URIs section): Exporting Data From BigQuery

这篇关于将数据从 BigQuery 导出到 GCS - 拆分文件大小控制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆