涉及磁盘时最快的稀疏矩阵访问 [英] quickest sparse matrix access, when disk is involved

查看:78
本文介绍了涉及磁盘时最快的稀疏矩阵访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下,您有一个具有10个Mio记录的表"users"和一个具有1个mio记录的表"groups".平均每个组有50个用户,我至少将它们存储在一个名为users2groups的表的rdbms中. users2groups实际上是一个稀疏矩阵.用户和组的全部数据集中只有80%可以容纳到可用内存中.组成员资格(users2groups)的数据位于最前面,因此,如果需要内存来缓存组成员资格,则必须从用户或组表或这两者中重新分配.

Imagine you have a table 'users' with 10 Mio records and a table 'groups' with 1 mio records. In average you have 50 users per group which I would store at least in an rdbms in a table called users2groups. users2groups is in fact a sparse matrix. Only 80% of the full dataset of users and groups fit into available memory. The data for the group membership (users2groups) comes on top, so that if memory is needed to cache group memberships this has to be deallocated from either the users or the groups table or both.

我希望能够:

  • 通过名称AND快速查找用户
  • 通过名称"AND"快速查找组
  • 快速获取组中的所有用户,并且
  • 快速获取用户所属的所有组

从我知道的经验来看,磁盘延迟决定了访问速度的良好扩展.您还可以在读写速度之间取得平衡.但是,在此之前,必须先确定数据库类型,例如:

From experiences I know, that disk latencies determine your access speed to a good extend. Also you can balance between read and write speed. However before one can do this one has to decide for a database type these days... such as:

  • 关系DBMS
  • 键值存储
  • 文档存储
  • RDF商店
  • 面向对象的DBMS
  • 图形DBMS
  • 搜索引擎
  • 多值DBMS
  • 本地XML DBMS
  • 宽列存储
  • 内容商店
  • 导航DBMS
  • (压缩的)位图
  • 文件
  • 更多...?
  • Relational DBMS
  • Key-value stores
  • Document stores
  • RDF stores
  • Object oriented DBMS
  • Graph DBMS
  • Search engines
  • Multivalue DBMS
  • Native XML DBMS
  • Wide column stores
  • Content stores
  • Navigational DBMS
  • (compressed) Bitmaps
  • files
  • more...?

因此问题是,当RAM容量低于可用数据(考虑到稀疏的混合)时,所有这些系统中的哪个或哪些组合具有最佳的整体读取访问性能(具有可接受的写入访问性能)? ...以及在所选技术中如何平衡所有三个实体/表的内存利用率??

So the question is which of all these systems or which combinations give the best overall read access performance (with an acceptable write access performance) when RAM capacity is lower then available data considering sparse matixes? ...and how has the memory utilization accross all three entities/tables has to be balanced in the choosen technology...?

由于这是一个概念性问题,磁盘空间和cpu容量超出范围或被认为无限期"可用.

Since this is a conceptional question disk space and cpu capacity are out of scope or considered to be available "indefinitely".

顺便说一句.我知道,通过使用基于哈希的索引(例如crc32(lower(STRING)))可以有效地加快搜索诸如用户名或组名之类的名称-例如,选择select可能是这样:从中选择有用的东西名称= SEARCHSTRING和hash = crc32(lower(SEARCHSTRING))的用户.但是,当我说用户表和组表具有80%的RAM覆盖率时,哈希及其索引尚未包含在内存中.那是因为我不清楚是否没有更好的集成解决方案.目前,我只是认为,将整个主题分成用户,组和users2groups三个部分是最明智的.我这里没有证据.

Btw. I am aware that searching for names such as user names or group names can efficiently be speeded up by the use of indexes based on hashes (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING)). However the hashes and their indexes are not included in the memory yet, when I said the users and groups table have 80% RAM coverage. That's because I am unclear, if there is not a better integrated solution. At the moment I just assume, that breaking up the whole topic into three pieces users, groups and users2groups is most sensible. I am lacking proof here.

--------------------更新-------------------------- -----------

-------------------- UPDATE -------------------------------------

我知道桌上有一些相互竞争的概念:

I understand that there are competing concepts on the table:

  • 反规范化扩展到了很大的程度,在这里我可以减少对磁盘的查询. (例如mongodb等)
  • 压缩数据,以便大多数数据无论如何都适合内存(例如压缩位图)

非规范化意味着:增加数据量",这两个概念似乎相互矛盾.是否存在最佳实践或科学或明智的论据,何时使用非规范化以及何时使用挤压数据方法?例如. KPI的说法是:如果内存中的内存不足80%,该去非规范化吗?

As denormalization means: 'blowing up data volumes' these both concepts seem to contradict each other. Are there best practices or scientific or sensible arguments, when to use denormalization and when to use the squeeze data approach? Eg. a KPI saying: If less than 80% fit into memory, go for denormalization or so?

--------------------更新-------------------------- -----------

-------------------- UPDATE -------------------------------------

额外的内存需要额外的钱,大多数数据库服务器通常都有大量的空磁盘空间,这让他们感到无聊.因此,非正规化很便宜.另一方面,非规范化机会有限:磁盘延迟从物理上限制了每秒最大查询(句号)的数量.因此,对磁盘的查询过多排队,这限制了非规范化对具有大量流量的应用程序的扩展.甚至非规范化的数据访问速度在很大程度上也取决于内存.

Extra memory costs extra money, most db servers have usually lots of empty disk space getting bored on their feet. So denormalization is cheap. On the other hand denormalization opportunities are limited: Disk latency physically limits the amount of max queries per second, full stop. So too many queries against disk get queued which constrains denormalization to an extend for applications with lots of traffic. Even denormalized data access speed depends on memory to a large extend.

因此,这里可能无法实现KPI.无论如何,对于给定的稀疏矩阵示例,如何平衡非规范化和压缩数据方法?我想压缩用户和组表,将它们留在rdbms中,然后将释放的内存分配给服务users2groups关系的文档db的缓存.但是,这引入了一整套新问题,例如用于处理2个数据库系统的多重往返和更复杂的查询管理.那么如何解决这个问题呢?

So maybe KPIs are not possible here. Anyway for the given sparse matrix expample how does denormalization and the squeeze data approach need to be balanced? I thought at compressing the users and groups table, leave them in a rdbms and than assign the freed memory to a cache of a document db which serves the users2groups relations. However this introduces a whole set of new issues such as muliple round trips to deal with 2 database systems and more complicated query management. So how to tackle this down?

-----------------------更新-----

----------------------- UPDATE -----

根据拉吉文的建议,仅具有标记关系的稀疏矩阵似乎是通过一种明智的方式解决的:拥有2个表用户和组,然后在表用户中有一个具有与组相关的ID的multipe ID字段,反之亦然在表组中,包含与用户相关的字段的多个ID字段.我猜这种解决方案与特定技术没有紧密结合.它甚至可以通过VARBINARY或任何Blob在MYSQL中实现.

As per suggestion of lagivan the sparse matrix with only flagging relations seems to be solved in a sensible way: have 2 tables users and groups and then have in the table users a multipe ID field with IDs related to groups and vice versa have in table groups a multiple ID fields with fields related to users. I guess this solution is not tightly coupled to a specific technology. It could even be implemented in MYSQL via VARBINARY or any blobs.

问题的突出部分与包含某些果汁信息"(例如状态或lastupdatedate)的稀疏矩阵有关.因此,使用外键数组会从概念上禁用那些信息.因此,针对这种情况的原始问题仍然悬而未决:在RAM容量低于可用数据(考虑到稀疏混合)的情况下,所有这些系统中的哪个或哪些组合具有最佳的总体读取访问性能(具有可接受的写入访问性能)? ...以及在所选技术中如何平衡所有三个实体/表的内存利用率??

The outstanding part of the question is related to sparse matrixes that contain some 'juice information' like status or lastupdatedate. So using foreign key arrays would sort of disable those information by concept. So the original questions for such scenario is still open: Which of all these systems or which combinations give the best overall read access performance (with an acceptable write access performance) when RAM capacity is lower then available data considering sparse matixes? ...and how has the memory utilization accross all three entities/tables has to be balanced in the choosen technology...?

推荐答案

这旨在解决与稀疏矩阵有关的部分问题,该稀疏矩阵包含状态或lastupdatedate等果汁信息".

This aims the part of the question which is related to sparse matrixes that contain some 'juice information' like status or lastupdatedate.

我试图归纳出答案的第一部分:实际上,我并没有找到真正的理由,为什么要从RDBMS转向任何其他技术来更好地解决稀疏矩阵.因此,让我们考虑RDBMS(非规范化数据可以存储在varbinary或blob中).我是标准化的忠实拥护者.但是,我现在学到的是:反规范化,如果考虑数据和索引数据,反规范化会导致较低的内存消耗.规范化规则的目标是在不考虑数据的情况下优化数据的内存消耗,因为稀疏矩阵(带有索引的外键-外键对)之类的场景很容易混淆规范化的好处和努力.我也(重新)学习到了,将数据尽可能好地压缩到内存中是性能的关键(拉吉文也主张基于缓存的性能杠杆).

I tried to generalize the first part of the answer: In fact I did not find a real reason, why to change away from RDBMS into any other technology to solve sparse matrixes better. So let's consider RDBMS (where denormalized data can be stored in varbinary or blobs) only. I am a big fan of normalization. However what I learned now, is: Denormalize, if denormalization results in lower memory consumption considering data AND index data. Normalization rules target to optimize data's memory consumption without taking into account, that there are scenarios such as sparse matrixes (with indexed foreign-foreign-key-pairs) that can easily confuse benefits and efforts of normalization. I sort of also (re-)learned, that squeeze data into memory as good as possible is THE key to performance (also lagivan argued on performance leverage based on caching).

话虽如此,答案的第二部分有多种选择:

Having said that, there are various options to go for the second part on the answer:

来自拉吉万(Lagivan)更新的人+

the one's from Lagivan's Update +

  • 每个果汁信息"用例都有自己的汇总/跟踪/审核表-也就是根据聚合数据及时进行反规范化.
  • 组合非规范化和规范化,其中非规范化仅适用于简单的关系(请参见答案的第一部分),使用果汁信息"进行规范化的时间范围非常有限(例如三个月)
  • 将果汁信息"写到任何日志文件中,然后偶尔(例如每天一次)汇总报告-这就是数据仓库的作用

现在的解决方案是为索引和数据的每个可接受的解决方案计算内存消耗,然后选择消耗值最低的选项.顺便说一下,与每个选项相关的实施工作水平不同.

The solution is now to calculate memory consumption for each acceptable solution for index AND data and then pick the option with the lowest consumption value. Btw there are different levels of implementation effort related to each option.

这篇关于涉及磁盘时最快的稀疏矩阵访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆