通过bigquery-python库将大量数据插入到BigQuery中 [英] Insert large amount of data to BigQuery via bigquery-python library

查看:232
本文介绍了通过bigquery-python库将大量数据插入到BigQuery中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的csv文件和excel文件,在这些文件中,我读取它们并根据它所具有的字段和类型动态创建所需的创建表脚本。然后将数据插入到创建的表中。



我已阅读 jobs.insert()而不是<$对于大量数据,c $ c> tabledata.insertAll()。

这就是我所说的(适用于较小文件不大

pre $ result = client.push_rows(datasetname,table_name,insertObject)#insertObject是一个字典列表

当我使用库的 push_rows 它在Windows中出现此错误。

  [Errno 10054]一个现有的连接被远程主机

强行关闭,这在ubuntu

  [Errno 32]损坏的管道

因此当我浏览 BigQuery-Python 它使用的代码 table_data.insertAll()



我怎样才能用这个库来做到这一点?我知道我们可以通过Google存储上传,但我需要使用此方法直接上传。

解决方案

当处理大文件时,请不要使用流式传输,但是批量加载:流式传输很容易处理每秒高达100,000行。这非常适合流式传输,但不适用于加载大型文件。



链接的示例代码做的是正确的事情(批量而不是流式传输),所以我们看到的是不同的问题:此示例代码试图将所有这些数据直接加载到BigQuery中,但通过POST部分上传失败。 gsutil 拥有比纯粹的POST更强大的上传算法。解决方案:不是加载大块数据通过POST,首先将它们放置在Google Cloud Storage中,然后告诉BigQuery从GCS读取文件。 BigQuery脚本无法处理大文件


I have large csv files and excel files where I read them and create the needed create table script dynamically depending on the fields and types it has. Then insert the data to the created table.

I have read this and understood that I should send them with jobs.insert() instead of tabledata.insertAll() for large amount of data.

This is how I call it (Works for smaller files not large ones).

result  = client.push_rows(datasetname,table_name,insertObject) # insertObject is a list of dictionaries

When I use library's push_rows it gives this error in windows.

[Errno 10054] An existing connection was forcibly closed by the remote host

and this in ubuntu.

[Errno 32] Broken pipe

So when I went through BigQuery-Python code it uses table_data.insertAll().

How can I do this with this library? I know we can upload through Google storage but I need direct upload method with this.

解决方案

When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.

The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails. gsutil has a more robust uploading algorithm than just a plain POST.

Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.

See also BigQuery script failing for large file

这篇关于通过bigquery-python库将大量数据插入到BigQuery中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆