BigQuery脚本无法处理大文件 [英] BigQuery script failing for large file

查看:137
本文介绍了BigQuery脚本无法处理大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试使用
脚本将 GoogleBigquery 加载到json文件 https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py 与非常很少修改。
我加了

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' code>

复制到 MediaFileUpload

脚本适用于具有几百万条记录的示例文件。实际文件约为140 GB,约有200,000,000条记录。 insert_request.execute()总是失败,并且

  socket.error:`[Errno 32] Broken pipe `

过了半小时左右。这怎么解决?每行不到1 KB,因此它不应该成为配额问题。在处理大文件时不要使用流式传输,但是批量加载:流式传输很容易每秒处理高达100,000行数据。这非常适合流式传输,但不适用于加载大型文件。



链接的示例代码做的是正确的事情(批量而不是流式传输),所以我们看到的是不同的问题:此示例代码试图将所有这些数据直接加载到BigQuery中,但通过POST部分上载失败。

解决方案:不是通过加载大块数据POST,先将它们放置在Google Cloud Storage中,然后告诉BigQuery从GCS读取文件。


$ b

更新:与工程团队交谈如果您尝试使用较小的 chunksize ,则应该可以工作。


I am trying to load a json file to GoogleBigquery using the script at https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/api/load_data_by_post.py with very little modification. I added

,chunksize=10*1024*1024, resumable=True))

to MediaFileUpload.

The script works fine for a sample file with a few million records. The actual file is about 140 GB with approx 200,000,000 records. insert_request.execute() always fails with

socket.error: `[Errno 32] Broken pipe` 

after half an hour or so. How can this be fixed? Each row is less than 1 KB, so it shouldn't be a quota issue.

解决方案

When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.

The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails.

Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.

Update: Talking to the engineering team, POST should work if you try a smaller chunksize.

这篇关于BigQuery脚本无法处理大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆