大型bigquery加载作业的最可靠格式 [英] most reliable format for large bigquery load jobs

查看:40
本文介绍了大型bigquery加载作业的最可靠格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个100 GB的表,正试图加载到Google bigquery中.它作为单个100GB的avro文件存储在GCS上.

I have a 100 GB table that I'm trying to load into google bigquery. It is stored as a single 100GB avro file on GCS.

当前,我的 bq加载作业失败,并显示一条无用的错误消息:

Currently my bq load job is failing with an unhelpful error message:

UDF worker timed out during execution.; Unexpected abort triggered for
worker avro-worker-156907: request_timeout

我正在考虑尝试其他格式.我知道bigquery支持多种格式(AVRO,JSON,CSV,Parquet等),并且原则上可以以任何一种格式加载大型数据集.

I'm thinking of trying a different format. I understand that bigquery supports several formats (AVRO, JSON, CSV, Parquet, etc) and that in principle one can load large datasets in any of these formats.

但是,我想知道这里是否有人在加载到bigquery中时会遇到哪种格式最可靠/最不容易在实践中出现怪癖?

However, I was wondering whether anyone here might have experience with which of these formats is most reliable / least prone to quirks in practice when loading into bigquery?

推荐答案

可能我将按照以下步骤解决:

Probably I'll solve following these steps:

  1. 创建大量 csv 格式的小文件
  2. 将文件发送到 GCS .
  1. Creating a ton of small files in csv format
  2. Sending the files to GCS .

将文件复制到GCS的命令:

Command to copy files to GCS:

gsutil -m cp <local folder>/* gs:<bucket name>

gsutil -m选项以执行并行(多线程/多处理)

gsutil -m option to perform a parallel (multi-threaded/multi-processing)

此后,我将使用 Cloud Dataflow默认模板 GCS 移至 BQ .链接.(请记住,使用默认模板不需要代码)

After that, I'll move from GCS to BQ using Cloud Dataflow default template. link. (Remember that using a default template you don't need code)

以下是调用数据流的示例链接:

Here a example to invoke dataflow link :

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

这篇关于大型bigquery加载作业的最可靠格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆