如何最好地处理写入 App Engine 中的中间表的大型查询结果 [英] How to best process large query results written to intermediate table in App Engine

查看:20
本文介绍了如何最好地处理写入 App Engine 中的中间表的大型查询结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在运行大型查询作业,其中响应大小达到 128M,而 BigQuery 引发响应太大而无法返回.考虑在您的作业配置中将 allowLargeResults 设置为 true"错误.

We are running large query jobs where we hit the 128M response size and BigQuery raises the "Response too large to return. Consider setting allowLargeResults to true in your job configuration" error.

我们选择了 allowLargeResults 方法来保持已经很复杂的 SQL 不变(而不是在这个级别分块).问题是处理写入中间表的结果的最佳方法是什么:

We are opting for the allowLargeResults approach to keep the already complex SQL unchanged (instead of chunking things at this level). The question is what is the best way to process the results written to the intermediate table:

  • 将表导出到 GCS,然后将使用偏移量处理响应文件块的任务排队到 GCS 文件中.这会引入来自 GCS、GCS 文件维护(例如清理文件)和另一个故障点(http 错误/超时等)的延迟.

  • Export the table to GCS, then queue tasks that process chunks of the response file using offsets into the GCS file. This introduces latency from GCS, GCS file maintenance (e.g. cleaning up files), and another point of failure (http errors/timeouts etc).

也使用排队任务从中间表查询块.这里的问题是对行进行分块的最佳方法是什么(是否有一种有效的方法可以做到这一点,例如是否有我们可以参考的内部行号?).我们可能最终会为每个块扫描整个表,因此这似乎比导出到 GCS 选项成本更高.

Query chunks from the intermediate tables also using queued tasks. The question here is what is the best way to chunk the rows (is there an efficient way to do this, e.g. is there an internal row number we can refer to?). We probably end up scanning the entire table for each chunk so this seems more costly than the export to GCS option.

在这方面有任何经验或建议吗?

Any experience in this area and or recommendations?

请注意,我们在 Google App Engine (Python) 中运行

Note that we are running in the Google App Engine (Python)

谢谢!

推荐答案

我了解 https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list 将让您无需执行查询即可读取表的块(产生数据处理费用).

I understand that https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list willl let you read chunks of a table without performing a query (incurring data processing charges).

这使您可以并行读取查询的结果,并且所有查询都将写入一个临时表 ID,您可以将其传递给此函数并提供不同的范围(使用 startIndex、maxResults).

This lets you read the results of a query in parallel as well as all queries are written to a temporary table id which you can pass to this function and supply different ranges (with startIndex,maxResults).

这篇关于如何最好地处理写入 App Engine 中的中间表的大型查询结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆