在PostgreSQL中,分区或多个数据库更有效吗? [英] in postgresql, are partitions or multiple databases more efficient?

查看:201
本文介绍了在PostgreSQL中,分区或多个数据库更有效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具有许多公司在其中发布信息的应用程序。每个公司的数据都是自包含的-没有数据重叠。

have an application in which many companies post information. the data from each company is self contained - there is no data overlap.

从性能角度来看,最好这样做:

performance-wise, is it better to:


  • 在每个表的每一行上保留公司ID并让每个索引使用它吗?

  • 根据公司ID对每个表进行分区
  • >
  • 分区并创建用户以访问每个公司以确保安全

  • 创建多个数据库,每个公司一个数据库

  • keep the company ID on each row of each table and have each index use it?
  • partition each table according to the company ID
  • partition and create a user to access each company to ensure security
  • create multiple databases, one for each company

具有持久连接的基于Web的应用程序。

web-based application with persistent connections.

我的想法:


  • 新的pg连接非常昂贵,因此单个数据库创建的新连接较少

  • 只有一个字典副本似乎比200左右的效率更高

  • 从程序员的错误中肯定可以更安全地使用多个数据库

  • 如果应更改应用程序规范以便公司共享,则很难实现多个数据

  • new pg connections are expensive, so a single database creates less new connections
  • having only one copy of the dictionary seems more efficient than 200 or so
  • multiple databases are certainly safer from programmer error
  • if application specs should change so companies share, multiple data base would be difficult to implement

推荐答案

我建议在PostgreSQL邮件列表中搜索有关多租户设计的信息。那里有很多讨论,答案归结为取决于。在保证隔离性,性能和可维护性之间的各种方式都需要权衡。

I'd recommend searching for info on the PostgreSQL mailing lists about multi-tenanted design. There's been lots of discussion there, and the answer boils down to "it depends". There are trade-offs every way between guaranteed isolation, performance, and maintainability.

一种常见的方法是使用单个数据库,但是使用一个模式(命名空间),每个模式中的表结构相同,以及共享或公用所有数据都相同的架构。 PostgreSQL模式就像MySQL的数据库一样,您可以跨不同的模式查询,但默认情况下它们是隔离的。如果客户数据位于单独的架构中,则可以使用 search_path 设置,通常通过 ALTER USER customername SET search_path ='customerschema,sharedschema'以确保每个客户都能看到他们的数据,只有他们的数据。

A common approach is to use a single database, but one schema (namespace) per customer with the same table structure in each schema, plus a shared or common schema for data that's the same across all of them. A PostgreSQL schema is like a MySQL "database" in that you can query across different schema but they're isolated by default. With customer data in separate schema you can use the search_path setting, usually via ALTER USER customername SET search_path = 'customerschema, sharedschema' to ensure each customer sees their data and only their data.

要获得更多保护,您应该 撤销 全部来自SCHEMA客户chema从公共,然后 GRANT ALL在SCHEMA customerchema上到客户,因此他们是唯一有权访问它的人,对每个表都执行相同的操作。然后,您的连接池可以使用具有 no GRANT 访问任何客户模式的固定用户帐户登录,但有权使用 设置角色 成为任何客户。 (通过设置NOINHERIT来为他们提供每个客户角色的成员资格,因此必须通过 SET ROLE 明确声明权利)。该连接应立即设置角色与其当前使用的客户。这样一来,您就可以避免为每个客户建立新连接的开销,同时保持强大的保护能力,以防止程序员错误导致访问错误的客户数据。只要该池执行 全部丢弃 和/或 RESET ROLE 在将连接移交给下一个客户端之前,这将为您提供非常强的隔离性,而不会挫败每个用户的单个连接。

For additional protection, you should REVOKE ALL FROM SCHEMA customerschema FROM public then GRANTALL ON SCHEMA customerschema TO thecustomer so they're the only one with any access to it, doing the same to each of their tables. Your connection pool then can log in with a fixed user account that has no GRANTed access to any customer schema but has the right to SET ROLE to become any customer. (Do that by giving them membership of each customer role with NOINHERIT set so rights have to be explicitly claimed via SET ROLE). The connection should immediately SET ROLE to the customer it's currently operating as. That'll allow you to avoid the overhead of making new connections for each customer while maintaining strong protection against programmer error leading to access to the wrong customer's data. So long as the pool does a DISCARD ALL and/or a RESET ROLE before handing connections out to the next client, that's going to give you very strong isolation without the frustration of individual connections per-user.

如果您的Web应用程序环境没有内置的像样的连接池(例如,您使用具有持久性连接的PHP),那么您真的需要放置Pg和Web服务器之间始终存在良好的连接池,因为太多与后端的连接会损害您的性能。 PgBouncer PgPool-II 是最好的选择,并且可以方便地执行 DISCARD ALL RESET ROLE

If your web app environment doesn't have a decent connection pool built-in (say, you're using PHP with persistent connections) then you really need to put a good connection pool in place between Pg and the web server anyway, because too many connections to the backend will hurt your performance. PgBouncer and PgPool-II are the best options, and handily can take care of doing the DISCARD ALL and RESET ROLE for you during connection hand-off.

此方法的主要缺点是维护这么多表的开销,因为您的基本表是非共享表为每个客户克隆。随着客户数量的增加,这一点将逐渐增加,以至于在自动真空运行期间要检查的表的纯粹数量开始变得昂贵,并且基于数据库中的表总数进行扩展的任何操作都将变慢。如果您正在考虑在同一数据库中拥有成千上万的客户,那么这将是一个更大的问题,但是我强烈建议您在提交之前使用虚拟数据对该设计进行一些扩展测试

The main downside of this approach is the overhead with maintaining that many tables, since your base set of non-shared tables is cloned for each customer. It'll add up as customer numbers grow, to the point where the sheer number of tables to examine during autovacuum runs starts to get expensive and where any operation that scales based on the total number of tables in the DB slows down. This is more of an issue if you're thinking of having many thousands or tens of thousands of customers in the same DB, but I strongly recommend you do some scaling tests with this design using dummy data before committing to it.

理想的方法可能是具有自动行级安全性来控制元组可见性的单个表,但是不幸的是PostgreSQL还没有。借助于SEPostgreSQL来添加合适的基础结构和API,看来它正在进行中,但是它不在9.1中。

The ideal approach is likely to be single tables with automatic row-level security controlling tuple visibility, but unfortunately that's something PostgreSQL doesn't have yet. It looks like it's on the way thanks to the SEPostgreSQL work adding suitable infrastructure and APIs, but it's not in 9.1.

这篇关于在PostgreSQL中,分区或多个数据库更有效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆