按日期分区? [英] Partitioning by date?

查看:102
本文介绍了按日期分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在试验BigQuery来分析由我们的软件应用程序生成的用户数据。

我们的工作表包含数以百万计的行,每行代表一个唯一的用户会话。每个包含一个时间戳,UUID和其他字段,描述用户在该会话期间与我们产品的交互。我们目前每天生成大约2GB的数据(~10M行)。

每隔一段时间,我们可能会针对整个数据集运行查询(现在大约需要2个月的时间)越来越多),但是典型的查询将跨越一天,一周或一个月。我们发现,随着我们的表增长,我们的单日查询变得越来越昂贵(正如我们期望的BigQuery架构所预期的那样)。

什么是最佳方式更有效地查询我们数据的子集?我可以想到的一种方法是在日期(或周,月等)将数据分区到单独的表格中,然后在联合中一起查询它们:



从$ b $中选择foo mytable_2012-09-01,
mytable_2012-09-02,
mytable_2012-09-03;



有没有比这更好的方法???

解决方案

处理这个问题的最佳方法是将数据分割到多个表中并按照您在示例中的建议运行查询。

更清楚的是,BigQuery没有一个概念索引(按设计),因此将数据分成不同的表格是保持查询尽可能经济高效的有用策略。
另一方面,另一个有用的功能担心有太多表格需要为表格设置 expirationTime 之后,表格将被删除,并且这些r存储回收 - 否则它们将无限期地持续存在。


We are experimenting with BigQuery to analyze user data generated by our software application.

Our working table consists hundreds of millions of rows, each representing a unique user "session". Each containing a timestamp, UUID, and other fields describing the user's interaction with our product during that session. We currently generate about 2GB of data (~10M rows) per day.

Every so often we may run queries against the entire dataset (about 2 months worth right now, and growing), However typical queries will span just a single day, week, or month. We're finding out that as our table grows, our single-day query becomes more and more expensive (as we would expect given BigQuery architecture)

What isthe best way to query subsets of of our data more efficiently? One approach I can think of is to "partition" the data into separate tables by day (or week, month, etc.) then query them together in a union:

SELECT foo from mytable_2012-09-01, mytable_2012-09-02, mytable_2012-09-03;

Is there a better way than this???

解决方案

Hi David: The best way to handle this is to shard your data across many tables and run queries as you suggest in your example.

To be more clear, BigQuery does not have a concept of indexes (by design), so sharding data into separate tables is a useful strategy for keeping queries as economically efficient as possible.

On the flip side, another useful feature for people worried about having too many tables is to set an expirationTime for tables, after which tables will be deleted and their storage reclaimed - otherwise they will persist indefinitely.

这篇关于按日期分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆