数十亿行的最佳数据存储 [英] Best data store for billions of rows

查看:47
本文介绍了数十亿行的最佳数据存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要能够为数十亿条记录(一年约 30 亿/月)存储少量数据(大约 50-75 字节).

唯一的要求是对具有相同 GUID 的所有记录进行快速插入和快速查找,并且能够从 .net 访问数据存储.

我是一名 SQL Server 专家,我认为 SQL Server 可以做到这一点,但是随着关于 BigTable、CouchDB 和其他 nosql 解决方案的所有讨论,它听起来越来越像是替代由于分布式查询和扩展的优化,传统的 RDBS 可能是最好的.我尝试过 cassandra,但 .net 库当前无法编译或都可能会更改(以及 cassandra 本身).

我研究了许多可用的 nosql 数据存储,但找不到一个满足我作为强大的生产就绪平台需求的存储.

如果您必须存储 360 亿条小型扁平记录以便可以从 .net 访问它们,您会选择什么?为什么?

解决方案

存储 ~3.5TB 数据并以大约 1K/秒 24x7 的速度插入,并且以未指定的速率查询,SQL Server 可以实现,但是有更多问题:

  • 您对此有什么可用性要求?99.999% 的正常运行时间,还是 95% 就足够了?
  • 您对可靠性有什么要求?缺少插入物是否会花费您 100 万美元?
  • 您有什么可恢复性要求?如果您丢失了一天的数据,这有关系吗?
  • 您有什么一致性要求?是否需要保证写入在下次读取时可见?

如果您需要我强调的所有这些要求,那么无论您尝试什么花招(分片、分区等),您提议的负载都将在关系系统、任何系统上花费数百万的硬件和许可.根据他们的定义,nosql 系统不能满足所有这些要求.

很明显,您已经放宽了其中一些要求.在 ,但是碎片会影响预读,并且只有重要范围扫描才需要预读.由于您只查询特定的 GUID 和日期范围,因此碎片并不重要.是的,是一个宽键,所以非叶子页面的键密度很差.是的,它会导致填充因子不佳.是的,可能会发生页面拆分.尽管存在这些问题,考虑到要求,仍然是最好的集群密钥选择.

  • 按时间对表进行分区,以便您可以通过 自动滑动窗口.通过上个月的在线索引分区重建来增强这一点,以消除 GUID 群集引入的不良填充因子和碎片.
  • 启用页面压缩.由于首先按 GUID 分组密钥,一个 GUID 的所有记录将彼此相邻,从而给出 页面压缩 部署字典压缩的好机会.
  • 您需要一个用于日志文件的快速 IO 路径.您对高吞吐量感兴趣,而不是对日志跟上 1K 插入/秒的低延迟感兴趣,所以 剥离 是必须的.
  • 分区和页面压缩都需要企业版 SQL Server,它们不适用于标准版,两者对于满足要求都非常重要.

    附带说明,如果记录来自前端 Web 服务器群,我会将 Express 放在每个 Web 服务器上,而不是在后端插入 INSERT,我会 SEND 信息到后端,使用 Express co 上的本地连接/事务- 位于网络服务器.这为解决方案提供了更好的可用性故事.

    这就是我在 SQL Server 中的做法.好消息是,您将面临的问题已得到充分理解,解决方案也已为人所知.这并不一定意味着这比使用 Cassandra、BigTable 或 Dynamo 所能实现的更好.我会让更了解 no-sql-ish 的人来论证他们的情况.

    请注意,我从未提及编程模型、.Net 支持等.老实说,我认为它们与大型部署无关.它们在开发过程中产生了巨大的差异,但是一旦部署,如果 ORM 开销会降低性能,那么开发速度有多快就无所谓了:)

    I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).

    The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net.

    I'm a SQL server guy and I think SQL Server can do this, but with all the talk about BigTable, CouchDB, and other nosql solutions, it's sounding more and more like an alternative to a traditional RDBS may be best due to optimizations for distributed queries and scaling. I tried cassandra and the .net libraries don't currently compile or are all subject to change (along with cassandra itself).

    I've looked into many nosql data stores available, but can't find one that meets my needs as a robust production-ready platform.

    If you had to store 36 billion small, flat records so that they're accessible from .net, what would choose and why?

    解决方案

    Storing ~3.5TB of data and inserting about 1K/sec 24x7, and also querying at a rate not specified, it is possible with SQL Server, but there are more questions:

    • what availability requirement you have for this? 99.999% uptime, or is 95% enough?
    • what reliability requirement you have? Does missing an insert cost you $1M?
    • what recoverability requirement you have? If you loose one day of data, does it matter?
    • what consistency requirement you have? Does a write need to be guaranteed to be visible on the next read?

    If you need all these requirements I highlighted, the load you propose is going to cost millions in hardware and licensing on an relational system, any system, no matter what gimmicks you try (sharding, partitioning etc). A nosql system would, by their very definition, not meet all these requirements.

    So obviously you have already relaxed some of these requirements. There is a nice visual guide comparing the nosql offerings based on the 'pick 2 out of 3' paradigm at Visual Guide to NoSQL Systems:

    After OP comment update

    With SQL Server this would e straight forward implementation:

    • one single table clustered (GUID, time) key. Yes, is going to get fragmented, but is fragmentation affect read-aheads and read-aheads are needed only for significant range scans. Since you only query for specific GUID and date range, fragmentation won't matter much. Yes, is a wide key, so non-leaf pages will have poor key density. Yes, it will lead to poor fill factor. And yes, page splits may occur. Despite these problems, given the requirements, is still the best clustered key choice.
    • partition the table by time so you can implement efficient deletion of the expired records, via an automatic sliding window. Augment this with an online index partition rebuild of the last month to eliminate the poor fill factor and fragmentation introduced by the GUID clustering.
    • enable page compression. Since the clustered key groups by GUID first, all records of a GUID will be next to each other, giving page compression a good chance to deploy dictionary compression.
    • you'll need a fast IO path for log file. You're interested in high throughput, not on low latency for a log to keep up with 1K inserts/sec, so stripping is a must.

    Partitioning and page compression each require an Enterprise Edition SQL Server, they will not work on Standard Edition and both are quite important to meet the requirements.

    As a side note, if the records come from a front-end Web servers farm, I would put Express on each web server and instead of INSERT on the back end, I would SEND the info to the back end, using a local connection/transaction on the Express co-located with the web server. This gives a much much better availability story to the solution.

    So this is how I would do it in SQL Server. The good news is that the problems you'll face are well understood and solutions are known. that doesn't necessarily mean this is a better than what you could achieve with Cassandra, BigTable or Dynamo. I'll let someone more knowleageable in things no-sql-ish to argument their case.

    Note that I never mentioned the programming model, .Net support and such. I honestly think they're irrelevant in large deployments. They make huge difference in the development process, but once deployed it doesn't matter how fast the development was, if the ORM overhead kills performance :)

    这篇关于数十亿行的最佳数据存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆