从BigQuery获取TableSchema结果PCollection< TableRow> [英] Get TableSchema from BigQuery result PCollection<TableRow>

查看:127
本文介绍了从BigQuery获取TableSchema结果PCollection< TableRow>的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在BigQuery Web UI中运行查询时,结果显示在一个表中,其中每个字段的名称和类型都是已知的(即使字段是COUNT(),AVG()的结果)...操作,字段的类型当然是已知的)。
结果可以直接导出为表/ json / csv。

When I run a query in BigQuery Web UI, the results are displayed in a table where both name and type of each field are known (even when a field is a result of COUNT(), AVG(), ... operation, type of field is known, of course). The results can be then directly exported as a table/json/csv.

我的问题是,当我在我的java项目中检索查询结果时,例如与查询:

My question is, when I retrieve query results in my java project, e.g. with a query:

String query =  "SELECT nationality, COUNT(DISTINCT personID) AS population 
                 FROM Dataset.Table 
                 GROUP BY nationality";

PCollection<TableRow> result = p.apply(BigQueryIO.Read.fromQuery(query));

...是否有可能获得 result PCollection,没有明确定义它?
我认为这一定是可能的,因为在使用BigQuery Web UI时可以使用相同的查询。
但是我不知道该怎么做......

... is it possible to obtain the schema of TableRow in result PCollection, without explicitly defining it? I think it must be possible, since it's possible with the same query when using BigQuery Web UI. But I can't figure out how to do it ...

TableSchema schema =  // function of PCollection<TableRow> result ?

result.apply(BigQueryIO.Write
                .named("Write Results Table")
                .to(getTableReference(tableName))
                .withSchema(schema));

这种方式的查询结果可以自动导出/保存到一个新表中(只有表名需要明确提供)。

That way query results could be always automatically exported/saved into a new table (only the table name then needs to be explicitly provided).

任何想法?任何帮助将不胜感激:)

Any ideas? Any help would be appreciated :)

推荐答案

不幸的是,Dataflow SDK并没有公开由BigQuery通过Dataflow的 BigQueryIO API。在数据流API中没有好的解决方法。

Unfortunately, Dataflow SDK doesn't expose a schema returned by BigQuery via Dataflow's BigQueryIO API. There's no "good" workaround within the Dataflow API alone.

手动定义模式是一种解决方法。

Defining a schema manually is one workaround.

或者,您可以直接通过 jobs:query ,然后将结果传递给 BigQueryIO.Write 转换。这可能会产生额外的成本,但可以通过略微改变查询来减少处理的数据量,从而缓解这一问题。输出的正确性不相关,因为您只会存储架构。

Alternatively, you could make a separate query to BigQuery directly via jobs: query at pipeline construction time, whose result can then be passed to BigQueryIO.Write transform. This may incur additional cost, but that can probably be mitigated by altering the query slightly to reduce the amount of data processed. Correctness of the output is not relevant, since you'd be storing the schema only.

这篇关于从BigQuery获取TableSchema结果PCollection&lt; TableRow&gt;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆