谷歌云存储加入多个csv文件 [英] Google Cloud Storage Joining multiple csv files

查看:31
本文介绍了谷歌云存储加入多个csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于 BigQuery 将文件导出为 99 个 csv 文件的文件大小,我将数据集从 Google BigQuery 导出到 Google Cloud Storage.

I exported a dataset from Google BigQuery to Google Cloud Storage, given the size of the file BigQuery exported the file as 99 csv files.

但是现在我想连接到我的 GCP Bucket 并使用 Spark 执行一些分析,但我需要将所有 99 个文件合并到一个大的 csv 文件中以运行我的分析.

However now I want to connect to my GCP Bucket and perform some analysis with Spark, yet I need to join all 99 files into a single large csv file to run my analysis.

如何实现?

推荐答案

BigQuery 如果是大于 1GB.但是您可以使用 gsutil 工具 合并这些文件,检查 此官方文档 了解如何使用 gsutil 执行对象组合.

BigQuery splits the data exported into several files if it is larger than 1GB. But you can merge these files with the gsutil tool, check this official documentation to know how to perform object composition with gsutil.

当 BigQuery 导出具有相同前缀的文件时,您可以使用通配符 * 将它们合并为一个复合对象:

As BigQuery export the files with the same prefix, you can use a wildcard * to merge them into one composite object:

gsutil compose gs://example-bucket/component-obj-* gs://example-bucket/composite-object

请注意,单个操作中可以组合的组件数量是有限制的(目前为 32 个).

Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.

此选项的缺点是每个 .csv 文件的标题行将添加到复合对象中.但是您可以通过修改 jobConfig 设置 print_header 参数False.

The downside of this option is that the header row of each .csv file will be added in the composite object. But you can avoid this by modifiyng the jobConfig to set the print_header parameter to False.

这是一个 Python 示例代码,但您可以使用任何其他 BigQuery 客户端库:

Here is a Python sample code, but you can use any other BigQuery Client library:

from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'yourBucket'

project = 'bigquery-public-data'
dataset_id = 'libraries_io'
table_id = 'dependencies'

destination_uri = 'gs://{}/{}'.format(bucket_name, 'file-*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)

job_config = bigquery.job.ExtractJobConfig(print_header=False)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    # Location must match that of the source table.
    location='US',
    job_config=job_config)  # API request

extract_job.result()  # Waits for job to complete.

print('Exported {}:{}.{} to {}'.format(
    project, dataset_id, table_id, destination_uri))

最后,记得用标题行组成一个空的 .csv.

Finally, remember to compose an empty .csv with just the headers row.

这篇关于谷歌云存储加入多个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆