如何最好地处理写入到App Engine中中间表的大型查询结果 [英] How to best process large query results written to intermediate table in App Engine

查看:75
本文介绍了如何最好地处理写入到App Engine中中间表的大型查询结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在运行大型查询作业,我们的响应大小达到128M,并且BigQuery引发响应太大而无法返回.请考虑在您的作业配置中将allowLargeResults设置为true"错误.

We are running large query jobs where we hit the 128M response size and BigQuery raises the "Response too large to return. Consider setting allowLargeResults to true in your job configuration" error.

我们选择allowLargeResults方法来使已经很复杂的SQL保持不变(而不是在此级别分块处理).问题是处理写入中间表的结果的最佳方法是什么?

We are opting for the allowLargeResults approach to keep the already complex SQL unchanged (instead of chunking things at this level). The question is what is the best way to process the results written to the intermediate table:

  • 将表导出到GCS,然后将使用偏移量处理响应文件块的任务排队到GCS文件中.这会引入GCS的延迟,GCS文件维护(例如清理文件)以及其他故障点(http错误/超时等).

  • Export the table to GCS, then queue tasks that process chunks of the response file using offsets into the GCS file. This introduces latency from GCS, GCS file maintenance (e.g. cleaning up files), and another point of failure (http errors/timeouts etc).

也使用排队的任务从中间表查询块.这里的问题是对行进行分块的最佳方法是什么(是否有一种有效的方法来做到这一点,例如是否有我们可以引用的内部行号?)我们可能最终会扫描整个表中的每个块,因此这似乎比导出到GCS选项的成本更高.

Query chunks from the intermediate tables also using queued tasks. The question here is what is the best way to chunk the rows (is there an efficient way to do this, e.g. is there an internal row number we can refer to?). We probably end up scanning the entire table for each chunk so this seems more costly than the export to GCS option.

这方面的经验和建议吗?

Any experience in this area and or recommendations?

请注意,我们正在Google App Engine(Python)中运行

Note that we are running in the Google App Engine (Python)

谢谢!

推荐答案

我了解

I understand that https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list willl let you read chunks of a table without performing a query (incurring data processing charges).

这使您可以并行读取查询结果,并且所有查询都写入一个临时表ID,您可以将该表ID传递给此函数并提供不同的范围(带有startIndex,maxResults).

This lets you read the results of a query in parallel as well as all queries are written to a temporary table id which you can pass to this function and supply different ranges (with startIndex,maxResults).

这篇关于如何最好地处理写入到App Engine中中间表的大型查询结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆