定期更新的 bigquery 数据库中的记录顺序 [英] The order of records in a regularly updated bigquery database

查看:30
本文介绍了定期更新的 bigquery 数据库中的记录顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将在 bigquery 上维护数据库的本地副本.我将使用 API 和 tabledata:list.这个数据库不是我自己的,维护者会通过附加新数据(比如每小时)定期更新.

I am going to be maintaining a local copy of a database on bigquery. I will be using the API and tabledata:list. This database is not my own, and is regularly updated by the maintainers by appending new data (say every hour).

  1. 首先,我可以假设当附加这些数据时,它肯定会被添加到数据库的末尾吗?

  1. First, can I assume that when this data is appended, it will definitely be added to the end of the database?

现在,让我们假设当前数据库有 1,000,000 行,我现在通过对 tabledata:list 进行分页来下载所有这些行.此外,我们假设数据库在中途更新(有 10,000 行).通过使用页面令牌,我可以确保我只会按照它们在数据库中的顺序开始下载出现的 1m 行吗?

Now, let's assume that currently the database has 1,000,000 rows and I am now downloading all of these by paging through tabledata:list. Also, let's assume that the database is updated partway through (with 10,000 rows). By using the page tokens, can I be assured that I will only download the 1m rows present when I started in the order they are in in the database?

最后,现在让我们说我来更新我的副本.如果我以 1,000,000 的 startIndex 启动 tabledata:list 并且我使用的 maxResults 为 1000,我会得到 10 页包含我期望的更新数据吗?

Finally, now let's say that I come to update my copy. If I initiate the tabledata:list with a startIndex of 1,000,000 and I use a maxResults of 1000, will I get 10 pages containing the updated data that I am expecting?

我想所有这些问题都归结为 bigquery 是否尊重数据的顺序,tabledata:list 是否使用这个顺序,以及附加的数据是否保证跟随以前的数据.

I suppose all these questions boil down to whether bigquery respects the order the data is in, whether this order is used by tabledata:list, and whether appended data is guaranteed to follow previous data.

因为有一列的值是唯一的,我可以执行一个简单的select count(1) from table 来获取表的长度,我当然可以检查我的本地副本通过将我的本地数据库的长度与远程数据库的长度进行比较来完成,但是如果不能保证上述内容并且我最终在我的数据中出现了漏洞,那么补救将是非常不切实际的,因为主键不是顺序的(否则我可以只填写缺失的行)并且数据库非常大.

As there is a column whose values are unique, and I can perform a simple select count(1) from table to get the length of the table, I can of course check that my local copy is complete by comparing the length of my local db with that of the remote, however if the above weren't guaranteed and I ended up with holes in my data, it would be quite impractical to remedy as the primary key is not sequential (otherwise I could just fill in the missing rows) and the database is very large.

推荐答案

  1. 当您追加数据时,我们会追加到表数据列表的末尾,但是,bigquery 可能会定期合并数据,这不尊重排序.我们一直在讨论能够保留排序,或者至少有一种访问最新数据的方法,但这尚未实现或设计.如果它对您来说是一项重要功能,请告诉我们,我们会相应地对其进行优先排序.

  1. When you append data, we will append to the end of the table data list, however, bigquery may periodically coalesce data, which does not respect ordering. We have been discussing being able to preserve the ordering, or at least have a way of accessing the most recent data, but this is not yet implemented or designed. If it is an important feature for you, let us know and we'll prioritize it accordingly.

如果您使用页面令牌,则可以确保获得稳定的列表.如果表格在数据分页过程中得到更新,您仍然只能看到创建页面令牌时表格中的数据.请注意,因此,页面令牌的有效期仅为 24 小时.

If you use page tokens, you are assured of a stable listing. If the table gets updated in the middle of paging through the data, you'll still only see the data that was in the table when you created the page token. Note that because of this, page tokens are only valid for 24 hours.

只要更新表格后没有发生合并,这应该有效.

This should work as long as no coalesce has occurred since you have updated the table.

您可以通过调用tables.get 来获取表中的行数,这通常比运行查询更简单、更快.

You can get the number of rows in the table by calling tables.get, which is usually simpler and faster than running a query.

这篇关于定期更新的 bigquery 数据库中的记录顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆