以有效的方式将BigQuery读入Spark吗? [英] Read from BigQuery into Spark in efficient way?

查看:176
本文介绍了以有效的方式将BigQuery读入Spark吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 BigQuery连接器从BigQuery I读取数据时发现它首先将所有数据复制到Google Cloud Storage.然后将这些数据并行读取到Spark中,但是在读取大表时,在复制数据阶段会花费很长时间.那么,有没有更有效的方式将数据从BigQuery读取到Spark中?

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is there more efficient way to read data from BigQuery into Spark?

另一个问题:从BigQuery读取包含2个阶段(复制到GCS,从GCS并行读取).复制阶段是否受Spark集群大小影响或需要固定时间?

Another Question: reading from BigQuery composed of 2 stages (copying to GCS, reading in parallel from GCS). does copying stage affected by Spark cluster size or it take fixed time?

推荐答案

也许Google员工会纠正我,但是AFAIK是唯一的方法.这是因为在后台它还使用了适用于Hadoop的BigQuery连接器,该连接器符合 docs :

Maybe a Googler will correct me, but AFAIK that's the only way. This is because under the hood it also uses the BigQuery Connector for Hadoop, which accordng to the docs:

用于Hadoop的BigQuery连接器会在运行Hadoop作业之前将数据下载到您的Google Cloud Storage存储桶中.

The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job..

请注意,使用Dataflow时也是如此-它也首先将BigQuery表导出到GCS,然后并行读取它们.

As a side note, this is also true when using Dataflow - it too performs an export of BigQuery table(s) to GCS first and then reads them in parallel.

WRT复制阶段(本质上是BigQuery导出作业)是否受Spark集群大小的影响,或者是否是固定时间-不. BigQuery导出作业是不确定的,BigQuery使用自己的资源导出到GCS,即不是您的Spark集群.

WRT whether or not the copying stage (which is essentially a BigQuery export job) is influenced by your Spark cluster size, or if it's a fixed time - no. BigQuery export jobs are nondeterministic, and BigQuery uses its own resources for exporting to GCS i.e. not your Spark cluster.

这篇关于以有效的方式将BigQuery读入Spark吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆