从BigQuery导出数据到GCS-拆分文件大小控制 [英] Exporting data to GCS from BigQuery - Split file size control

查看:83
本文介绍了从BigQuery导出数据到GCS-拆分文件大小控制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当前正在将数据从Bigquery导出到GCS存储桶.我正在使用以下查询以编程方式执行此操作:

I am currently exporting data from Bigquery to GCS buckets. I am doing this programmatically using the following query:

query_request = bigquery_service.jobs()

DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';

DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
  'extract': {
    'sourceTable': {
            'projectId': PROJECT_ID,
            'datasetId': DATASET_ID,
            'tableId': #####,
     },
    'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
    'destinationFormat': 'CSV',
    'printHeader': 'false',
    'compression': 'GZIP'
   }
 }

}

query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
                                     body=query_data).execute()

由于存在每个文件只能导出1GB的限制,因此我使用了单个通配符URI(

Since there is a constraint that only 1GB per file can be exported to GCS, I used the single wildcard URI (https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple). This splits the file into multiple smaller parts. After splitting, each of the file parts are gzipped as well.

我的问题:我可以控制分割文件的文件大小吗?例如,如果我有一个14GB的文件要导出到GCS,它将被拆分为14个1GB的文件.但是,是否有办法将1GB更改为其他大小(小于gzip压缩前的1GB)?我检查了可用于修改configuration.extract对象的各种参数? (请参阅: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract )

My question: Can I control the file sizes of the split files? For example, if I have a 14GB file to export to GCS, this will be split into 14 1GB files. But is there a way to change that 1GB into another size (smaller than 1GB before gzipping)? I checked the various parameters that are available for modifying the configuration.extract object? (Refer: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract)

推荐答案

如果指定多个URI模式,则数据将在它们之间进行分片.因此,如果您使用了28个URI模式,则每个分片将约为GB的一半.您最终将获得每个模式的大小为零的第二个文件,因为这实际上是针对MR作业的,但这是完成所需任务的一种方法.

If you specify multiple URI patterns, the data will be sharded between them. So if you used, say, 28 URI patterns, each shard would be about half a GB. You'd end up with second files of size zero for each pattern, as this is really meant for MR jobs, but its one way to accomplish what you want.

此处有更多信息(请参见多个通配符URI"部分):从BigQuery导出数据

More info here (see the Multiple Wildcard URIs section): Exporting Data From BigQuery

这篇关于从BigQuery导出数据到GCS-拆分文件大小控制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆