如何在每个请求中将bigquery响应分成10000个? [英] How to fragment bigquery response into 10000 in every request?

查看:34
本文介绍了如何在每个请求中将bigquery响应分成10000个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有bigquery 'SELECT visitorId,totals.visits FROM [12123333.ga_sessions_20160602]',它在一个请求中返回500k行.

I have bigquery 'SELECT visitorId , totals.visits FROM [12123333.ga_sessions_20160602]' which return 500k rows in one request.

但是我想在一个请求中将数据从1到10,000行分段,在下一个请求中,将提取下一个10,001到20,000,依此类推.

But I want to fragment data from 1 to 10,000 row in one request and in next request, the next 10,001 to 20,000 will be fetch and so on.

谢谢.

推荐答案

一个选项是将查询结果写入目标表,然后使用

One option would be to write result of your query into destination table and then use Tabledata: list API to retrieve data from that table in a paged manner either using maxResults and pageToken to retrieve page by page or maxResults and startIndex to retrieve specified set of rows.

另一种选择是将row_number添加到您的查询中(如下所示)

Another option would be to add row_number to your query (something like below)

SELECT visitorId , totals.visits,  
  ROW_NUMBER() OVER() as num
FROM [12123333.ga_sessions_20160602]

仍将结果写入目标临时表,然后使用新的 num 字段从该表中检索数据,以分组为 num%10000 = {group_number} .或者,您可以使用 INTEGER(num/10000)= {group_number} -您想要的更多

with still writing result into destination temp table and then retrieve data from that table using new num field for grouping as num % 10000 = {group_number} for example . Or you can use INTEGER(num / 10000) = {group_number} - whatever you like more

SELECT visitorId , totals.visits 
FROM tempTable
WHERE num % 10000 = 0 

下一个将与

WHERE num % 10000 = 1 

以此类推...

请注意:第二种选择使用昂贵的(明智的执行方式-而不是明智的结算方式)ROW_NUMBER()函数,该函数要求每个分区(在这种情况下,只有一个分区-所有行)的所有数据都在同一节点中-因此取决于它能否工作的行数.对于只有50万行的特定示例,它可以工作-但如果将其扩展到具有数百万行的表,则可能不起作用(取决于每行输出的数据量和行数)

Please note: second option uses expensive (execution wise - not billing wise) ROW_NUMBER() function which requires all data for each partition (in this case it is only one partition - all rows) to be in the same node - so depends on number of rows it can work or not. For your specific example with just 500K rows it's going to work - but if you extend it to table with millions and millions rows - it might not (depends on how much data you output in each row and number of rows)

另外一个注意事项:
-在第一种选择中,当您生成结果并将其保存到临时表中时,您只需支付一次.然后-从某种意义上来说,Tabledata.list API是免费的,因为它本身并不使用BigQuery查询,而是直接从基础数据中读取,因此是免费的.
-在第二种选择中,您既要支付费用,又要在每次检索/查询另一个组时都生成临时表,因为这都是BigQuery查询.此外,每次您获得特定组的数据时-扫描整个临时表都需要付费-因此您的情况要多花50遍

One more note:
- in first option you pay only once when you generate result and save it into temp table. Then - it's free in a sense that Tabledata.list API is free to use as it does not use BigQuery query per se, but rather just reads directly from underlying data.
- in second option you pay both - and when you generate temp table and each time you retrieve/query yet another group - because it is all BigQuery queries. Moreover each time you get data for specific group - you are charged for scanning whole temp table - so in your case it is extra 50 times

(在您的情况下)这使第一种选择的价格比第二种便宜:o)

This makes (in your case) first option around 51 times cheaper than second one :o)

这篇关于如何在每个请求中将bigquery响应分成10000个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆