在数据流中使用 `fromTable` 和 `fromQuery("SELECT * ...")` 时,`BigQueryIO` 是否有区别? [英] Is there a difference in `BigQueryIO` when you use `fromTable` vs `fromQuery("SELECT * ...")` in dataflow?
问题描述
当您需要在数据流作业中从 bigquery 的一个或多个表中读取所有数据时,我会说有两种方法.第一种方法是将 BigQueryIO
与 from
结合使用,后者读取有问题的表,第二种方法是使用 fromQuery
在其中指定一个从同一个表中读取所有数据的查询.所以我的问题是:
- 使用其中一种是否有任何成本或性能优势?
我在文档中没有找到任何关于此的内容,但我真的很想知道.我想也许 read
会更快,因为您不需要运行扫描数据的查询,这意味着它更类似于您在 BigQuery
UI 中的预览功能.如果这是真的,它也可能便宜得多,但如果它们的成本相同,那就有意义了.
简而言之,两者有什么区别:
BigQueryIO.read(...).from(tableName)
还有
BigQueryIO.read(...).fromQuery("SELECT * FROM " + tableName)
from
比 fromQuery(SELECT * FROM ...)
既便宜又快捷.>
from
直接导出表,导出数据免费BigQuery.fromQuery(SELECT * FROM ...)
将首先扫描整个表($5/TB)并导出结果.
When you need to read all the data from one or more tables in bigquery in a dataflow job there are two approaches to it I would say. The first one is to use BigQueryIO
with from
, which reads the table in question, and the second approach is to use fromQuery
where you specify a query that reads all the data from the same table. So my question is:
- Is it any cost or performance benefit for using one over the other?
I haven't find anything in the docs about this, but I would really like to know. I imagine that maybe read
is faster since you don't need to run a query that scans the data, meaning it is more similar to the preview functionality you have in BigQuery
UI. If that is true it might also be much cheaper, but it make sense if they both cost the same.
So in short, what is the difference between:
BigQueryIO.read(...).from(tableName)
And
BigQueryIO.read(...).fromQuery("SELECT * FROM " + tableName)
from
is both cheaper and faster than fromQuery(SELECT * FROM ...)
.
from
directly exports the table and exporting data is free for BigQuery.fromQuery(SELECT * FROM ...)
will first scan the entire table ($5/TB) and export the result.
这篇关于在数据流中使用 `fromTable` 和 `fromQuery("SELECT * ...")` 时,`BigQueryIO` 是否有区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!