如何为 API 客户端提供 1,000,000 个数据库结果? [英] How to provide an API client with 1,000,000 database results?

查看:18
本文介绍了如何为 API 客户端提供 1,000,000 个数据库结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

跟进我之前的问题:

使用光标"用于 PostgreSQL 中的分页

向 API 客户端提供 1,000,000 个数据库结果的好方法是什么?

What is a good way to provide an API client with 1,000,000 database results?

我们目前使用的是 PostgreSQL.一些建议的方法:

We are currently using PostgreSQL. A few suggested methods:

  • 使用光标
  • 进行分页
  • 使用随机数分页(向每个查询添加GREATER THAN ORDER BY")
  • 使用 LIMIT 和 OFFSET 进行分页(分解为非常大的数据集)
  • 将信息保存到文件并让客户端下载
  • 遍历结果,然后将数据POST到客户端
  • 仅向客户端返回密钥,然后让客户端从 Amazon S3 等云文件中请求对象(可能仍需要分页以获取文件名).

我有没有想过这比这些选项中的任何一个都简单且好得多?

What haven't I thought of that is stupidly simple and way better than any of these options?

推荐答案

该表有一个主键.好好利用它.

The table has a primary key. Make use of it.

代替 LIMITOFFSET,在主键上使用过滤器进行分页.您在评论中暗示了这一点:

Instead of LIMIT and OFFSET, do your paging with a filter on the primary key. You hinted at this with your comment:

使用随机数分页(向每个随机数添加GREATER THAN ORDER BY"查询)

Paging using random numbers ( Add "GREATER THAN ORDER BY " to each query )

但是你应该怎么做并不是随机的.

but there's nothing random about how you should do it.

SELECT * FROM big_table WHERE id > $1 ORDER BY id ASC LIMIT $2

允许客户端指定两个参数,它看到的最后一个 ID 和要获取的记录数.您的 API 必须有一个占位符、额外的参数或替代调用获取 first n 个 ID",它从查询中省略了 WHERE 子句,但那是微不足道.

Allow the client to specify both parameters, the last ID it saw and the number of records to fetch. Your API will have to either have a placeholder, extra parameter, or alternate call for "fetch the first n IDs" where it omits the WHERE clause from the query, but that's trivial.

这种方法将使用相当有效的索引扫描来按顺序获取记录,通常避免排序或需要遍历所有跳过的记录.客户端可以决定一次需要多少行.

This approach will use a fairly efficient index scan to get the records in order, generally avoiding a sort or the need to iterate through all the skipped records. The client can decide how many rows it wants at once.

这种方法与 LIMITOFFSET 方法的不同之处在于一个关键方面:并发修改.如果你INSERT到表中的键低于某个客户已经看到的键,这种方法根本不会改变它的结果,而OFFSET 方法将重复一行.类似地,如果您 DELETE 一个比已经看到的 ID 低的行,这种方法的结果不会改变,而 OFFSET 将跳过一个看不见的行.不过,对于带有生成键的仅追加表没有区别.

This approach differs from the LIMIT and OFFSET approach in one key way: concurrent modification. If you INSERT into the table with a key lower than a key some client has already seen, this approach will not change its results at all, whereas the OFFSET approach will repeat a row. Similarly, if you DELETE a row with a lower-than-already-seen ID the results of this approach will not change, whereas OFFSET will skip an unseen row. There is no difference for append-only tables with generated keys, though.

如果您事先知道客户端需要整个结果集,那么最有效的做法就是将整个结果集发送给他们,而无需进行分页业务.那就是我使用游标的地方.从数据库中读取行并以客户端接受它们的速度将它们发送到客户端.这个 API 需要对客户端允许的速度设置限制以避免过多的后端负载;对于速度较慢的客户端,我可能会切换到分页(如上所述)或将整个游标结果假脱机到一个临时文件并关闭数据库连接.

If you know in advance that the client will want the whole result set, the most efficient thing to do is just send them the whole result set with none of this paging business. That's where I would use a cursor. Read the rows from the DB and send them to the client as fast as the client will accept them. This API would need to set limits on how slow the client was allowed to be to avoid excessive backend load; for a slow client I'd probably switch to paging (as described above) or spool the whole cursor result out to a temporary file and close the DB connection.

重要警告:

  • 需要 UNIQUE 约束/UNIQUE 索引或 PRIMARY KEY 才能可靠
  • 用于限制/抵消的不同并发修改行为,见上文
  • Requires a UNIQUE constraint / UNIQUE index or PRIMARY KEY to be reliable
  • Different concurrent modification behaviour to limit/offset, see above

这篇关于如何为 API 客户端提供 1,000,000 个数据库结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆