将数据从 BigQuery 导出到 GCS - 拆分文件大小控制 [英] Exporting data to GCS from BigQuery - Split file size control
问题描述
我目前正在将数据从 Bigquery 导出到 GCS 存储桶.我正在使用以下查询以编程方式执行此操作:
I am currently exporting data from Bigquery to GCS buckets. I am doing this programmatically using the following query:
query_request = bigquery_service.jobs()
DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';
DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
'extract': {
'sourceTable': {
'projectId': PROJECT_ID,
'datasetId': DATASET_ID,
'tableId': #####,
},
'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
'destinationFormat': 'CSV',
'printHeader': 'false',
'compression': 'GZIP'
}
}
}
query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
body=query_data).execute()
由于限制每个文件只能导出 1GB 到 GCS,我使用了单个通配符 URI (https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple).这会将文件拆分为多个较小的部分.拆分后,每个文件部分也会被压缩.
Since there is a constraint that only 1GB per file can be exported to GCS, I used the single wildcard URI (https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple). This splits the file into multiple smaller parts. After splitting, each of the file parts are gzipped as well.
我的问题:我可以控制拆分文件的文件大小吗?例如,如果我有一个 14GB 的文件要导出到 GCS,这将被拆分为 14 个 1GB 的文件.但是有没有办法将 1GB 更改为另一种大小(在 gzip 压缩之前小于 1GB)?我检查了可用于修改 configuration.extract 对象的各种参数?(参考:https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract)
My question: Can I control the file sizes of the split files? For example, if I have a 14GB file to export to GCS, this will be split into 14 1GB files. But is there a way to change that 1GB into another size (smaller than 1GB before gzipping)? I checked the various parameters that are available for modifying the configuration.extract object? (Refer: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract)
推荐答案
如果您指定多个 URI 模式,数据将在它们之间进行分片.因此,如果您使用 28 个 URI 模式,则每个分片大约为半 GB.对于每个模式,您最终会得到大小为零的第二个文件,因为这确实适用于 MR 作业,但它是完成您想要的工作的一种方式.
If you specify multiple URI patterns, the data will be sharded between them. So if you used, say, 28 URI patterns, each shard would be about half a GB. You'd end up with second files of size zero for each pattern, as this is really meant for MR jobs, but its one way to accomplish what you want.
此处的更多信息(请参阅多通配符 URI 部分):从 BigQuery 导出数据
More info here (see the Multiple Wildcard URIs section): Exporting Data From BigQuery
这篇关于将数据从 BigQuery 导出到 GCS - 拆分文件大小控制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!