如何将多个csv文件(不同的架构)加载到bigquery中 [英] How to load multiple csv files (different schemas) into bigquery

查看:46
本文介绍了如何将多个csv文件(不同的架构)加载到bigquery中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有6,500个具有约250个不同模式的csv文件.即这些文件来自F.D.I.C(美国银行监管机构)数据集.它们已上传到Google云端存储桶:

I have 6,500 csv files with ~250 different schema's. i.e. These files are from the F.D.I.C (USA bank regulator) dataset. They have been uploaded to a google cloud storage bucket:

每个财政季度都有约250个不同的csv.一个财务季度内的每个csv具有不同的架构:

Each financial quarter has ~250 different csv's. Each csv, within a financial quarter, has a different schema:

大约有250个独特的架构.该模式在每个财务季度重复一次.csv文件可追溯到1992年的100个财务季度:

There are ~250 unique schemas. The schema's repeat themselves, each financial quarter. The csv files go back 100 financial quarters to 1992:

可以使用通配符上载具有相同架构的多个CSV.例如 gs/path/*.csv .但是,不是从文件名自动生成每个表名.UI需要一个表名作为输入:

Multiple CSV's,with the same schema, can be uploaded using a wild card. e.g. gs/path/*.csv. However each table name is not being auto generated from the file name. The UI requires a table name as an input:

如何将多个具有不同架构的csv文件加载到bigquery中?

How does one load multiple csv files with different schemas into bigquery?

推荐答案

我要自动执行此操作的方法基本上是从给定存储桶(或其子文件夹)读取所有文件,并使用其文件名"(进行假设)"作为要提取的目标表名.方法如下:

The way I would go about automating this is basically reading all the files from a given bucket (or its subfolder) and (making an assumption) using their "filename" to be the target tablename to ingest. Here is how:

gsutil ls gs://mybucket/subfolder/*.csv | xargs -I{} echo {} | awk '{n=split($1,A,"/"); q=split(A[n],B,"."); print "mydataset."B[1]" "$0}' | xargs -I{} sh -c 'bq --location=US load --replace=false --autodetect --source_format=CSV {}'

请确保将 location mydataset 替换为所需的值.另外,请注意以下假设:

Make sure to replace location, mydataset with your desired values. Also, please take note of the following assumptions:

  • 假定每个CSV的第一行都是标题,因此被视为列名.
  • 我们正在使用-replace = false 标志进行编写,这意味着数据将在您每次运行命令时附加.如果您想改写,只需将其设置为 true ,则每次运行都会覆盖所有表的数据.
  • CSV文件名( .csv 之前的部分用作表名.您可以修改awk脚本以将其更改为其他任何替代名称.
  • First row of each CSV is assumed to be the header, and thus is treated as column names.
  • We are writing with --replace=false flag, meaning data will be appended everytime you run the command. If you want to overwrite instead, just turn it to true and all tables' data will be over-written on each run.
  • CSV filenames (part before .csv is used as a tablename. You can modify the awk script to change it to any other alternative.

这篇关于如何将多个csv文件(不同的架构)加载到bigquery中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆