BigQueryIO读取与fromQuery [英] BigQueryIO Read vs fromQuery

查看:101
本文介绍了BigQueryIO读取与fromQuery的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Dataflow/Apache Beam程序中说,我正在尝试读取表,该表的数据呈指数增长.我想提高阅读效果.

Say in Dataflow/Apache Beam program, I am trying to read table which has data that is exponentially growing. I want to improve the performance of the read.

BigQueryIO.Read.from("projectid:dataset.tablename")

BigQueryIO.Read.fromQuery("SELECT A, B FROM [projectid:dataset.tablename]")

如果我仅选择表中所需的列,而不是上面的整个表,则读取性能会提高吗?

Will the performance of my read improve, if i am only selecting the required columns in the table, rather than the entire table in above?

我知道选择很少的列可以降低成本.但是想知道上面的读取性能.

I am aware that selecting few columns results in the reduced cost. But would like to know the read performance in above.

推荐答案

您是对的,它可以降低成本,而不是引用SQL/查询中的所有列.同样,当您使用from()而不是fromQuery()时,您无需为BigQuery中的任何表扫描付费.我不确定您是否意识到这一点.

You're right that it will reduce cost instead of referencing all the columns in the SQL/query. Also, when you use from() instead of fromQuery(), you don't pay for any table scans in BigQuery. I'm not sure if you were aware of that or not.

在后台,每当Dataflow从BigQuery读取数据时,它实际上都会调用其导出API,并指示BigQuery将表作为分片文件转储到GCS.然后,Dataflow将这些文件并行读取到管道中.它尚未从BigQuery直接"准备就绪.

Under the hood, whenever Dataflow reads from BigQuery, it actually calls its export API and instructs BigQuery to dump the table(s) to GCS as sharded files. Then Dataflow reads these files in parallel into your pipeline. It does not ready "directly" from BigQuery.

如此,是的,此可能可以提高性能,因为需要导出到引擎盖下的GCS并读入管道的数据量会更少,即列数减少=数据量减少.

As such, yes, this might improve performance because the amount of data that needs to be exported to GCS under the hood, and read into your pipeline will be less i.e. less columns = less data.

但是,我也将考虑使用分区表,然后甚至考虑对其进行群集.另外,使用WHERE子句甚至可以进一步减少要导出和读取的数据量.

However, I'd also consider using partitioned tables, and then even think about clustering them too. Also, use WHERE clauses to even further reduce the amount of data to be exported and read.

这篇关于BigQueryIO读取与fromQuery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆