如何将多个 csv 文件(不同的架构)加载到 bigquery [英] How to load multiple csv files (different schemas) into bigquery
问题描述
我有 6,500 个 csv 文件,其中包含大约 250 个不同的架构.即这些文件来自 F.D.I.C(美国银行监管机构)数据集.它们已上传到谷歌云存储桶:
I have 6,500 csv files with ~250 different schema's. i.e. These files are from the F.D.I.C (USA bank regulator) dataset. They have been uploaded to a google cloud storage bucket:
每个财务季度都有大约 250 个不同的 csv.一个财政季度内的每个 csv 都有不同的架构:
Each financial quarter has ~250 different csv's. Each csv, within a financial quarter, has a different schema:
有大约 250 个独特的模式.每个财政季度,模式都会重复.csv 文件可以追溯到 1992 年的 100 个财务季度:
There are ~250 unique schemas. The schema's repeat themselves, each financial quarter. The csv files go back 100 financial quarters to 1992:
具有相同架构的多个 CSV 文件可以使用通配符上传.例如gs/path/*.csv
.但是,每个表名都不是从文件名自动生成的.UI 需要一个表名作为输入:
Multiple CSV's,with the same schema, can be uploaded using a wild card. e.g. gs/path/*.csv
. However each table name is not being auto generated from the file name. The UI requires a table name as an input:
如何将具有不同架构的多个 csv 文件加载到 bigquery 中?
How does one load multiple csv files with different schemas into bigquery?
推荐答案
我要实现自动化的方法基本上是读取给定存储桶(或其子文件夹)中的所有文件,并(假设)使用它们的文件名" 成为要摄取的目标表名.方法如下:
The way I would go about automating this is basically reading all the files from a given bucket (or its subfolder) and (making an assumption) using their "filename" to be the target tablename to ingest. Here is how:
gsutil ls gs://mybucket/subfolder/*.csv | xargs -I{} echo {} | awk '{n=split($1,A,"/"); q=split(A[n],B,"."); print "mydataset."B[1]" "$0}' | xargs -I{} sh -c 'bq --location=US load --replace=false --autodetect --source_format=CSV {}'
确保将 location
、mydataset
替换为您想要的值.另外,请注意以下假设:
Make sure to replace location
, mydataset
with your desired values. Also, please take note of the following assumptions:
- 假设每个 CSV 的第一行是标题,因此被视为列名.
- 我们使用
--replace=false
标志编写,这意味着每次运行命令时都会附加数据.如果您想改写,只需将其设置为true
,每次运行时所有表的数据都将被覆盖. - CSV 文件名(
.csv
之前的部分用作表名.您可以修改 awk 脚本以将其更改为任何其他选项.
- First row of each CSV is assumed to be the header, and thus is treated as column names.
- We are writing with
--replace=false
flag, meaning data will be appended everytime you run the command. If you want to overwrite instead, just turn it totrue
and all tables' data will be over-written on each run. - CSV filenames (part before
.csv
is used as a tablename. You can modify the awk script to change it to any other alternative.
这篇关于如何将多个 csv 文件(不同的架构)加载到 bigquery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!