NoSQL数据库中最好的文档存储策略是什么? [英] What is the best document storage strategy in NoSQL databases?

查看:141
本文介绍了NoSQL数据库中最好的文档存储策略是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

像Couchbase这样的NoSQL数据库在内存中保存了大量文档,因此它们的速度非常快,但是它对运行的服务器的内存大小提出了更高的要求。



我正在寻找在几个相反的策略之间存储文档在NoSQL数据库中的最佳策略。这些是:




  • 优化速度



将整个信息放入一个(大)文档中具有以下优点:利用单个GET,可以从存储器或从磁盘(如果之前从存储器清除)中检索信息。使用无模式的NoSQL数据库,这几乎是希望。但是最终文档将变得太大,并且消耗了大量的内存,较少的文档将能够保存在内存中。




  • 优化内存



将所有文档拆分为多个文档(例如,使用复合键作为此问题中描述的内容:为面向文档的数据库设计记录键 - 最佳做法,尤其是那些文档只保存在特定读取/更新操作中所需的信息,将允许在存储器中保存更多(暂时)文档。



我在看的用例at是来自电信提供商的呼叫详细记录(CDR),这些CDR通常每天进入数亿,但是,这些客户中的许多在每一天不提供单个记录(我在看南 - 东亚市场的预付主导和仍然较少的数据饱和)。这意味着通常大量文档每隔一天有读/更新,只有很小的百分比每天有几个读/更新周期。



建议我的一个解决方案是建立2个桶,更多的RAM分配给更多的瞬态,更少的RAM分配给第二个桶持有较大的文档。这将允许更快地访问更多的瞬态数据,并且更慢的一个到更大的文档,例如保持根本不改变的简档/用户信息。我看到这个建议的两个缺点,一个是你不能构建一个视图(Map / Reduce)跨两个桶(这是专门为Couchbase,其他NoSQL解决方案可能允许这一点),第二个将是更多的开销在密切管理两个桶的内存分配之间的平衡作为用户基数的增长。



有没有其他人受到这个挑战,你的解决方案是什么问题?从你的POV将是什么最好的战略,为什么?显然,它大部分是在两种策略的中间,只有一个文档或一个大文档分成几百个文档不能是理想的解决方案IMO。



EDIT 2014-9-14
好​​的,虽然这接近回答我自己的问题,但没有任何提供的解决方案,到目前为止,并在这里的评论是一些更多的背景如何我现在计划组织我的数据,尝试实现速度和记忆消耗之间的甜蜜点:



Mobile_No:个人资料




  • 这保存来自表的配置文件信息,而不是直接来自CDR。这里的短暂数据越少,如年龄,性别和姓名。键是由移动号码(MSISDN)和单词个人资料组成的复合键,用:分隔。



Mobile_No:收入




  • 其中包含使用计数器和累积客户总收入的变量等瞬态信息。键还是由移动号码(MSISDN)和单词收入组成的复合键,用:分隔。



Mobile_No :Optin




  • 这是一个半暂态的信息,关于客户何时选择加入计划,以及何时选择退出计划。这可以发生多次,并通过数组处理。键还是由移动号码(MSISDN)和单词optin组成的复合键,用:分隔。



Connection_Id




  • 它保存通过语音或视频呼叫或SMS / MMS完成的特定A / B连接(发送者/接收者)的信息。



在文档结构的这些更改之前,我把所有的配置文件,收入和optin信息放在一个大文档中,始终将connection_id保留为单独的文档。这个新的文档存储策略给我希望在速度和内存消耗之间更好的折衷,我将主要文档拆分成几个文档,以便每个文档只有重要的信息,在应用程序的一个步骤中读取/更新。 p>

这也考虑了一段时间内不同的变化率,一些数据是非常短暂的(比如计数器和每个CDR都要更新的累积收入字段),简档信息大部分不变。

解决方案

我希望这可以更好地理解我想要实现的目标,感谢您更新原始问题。当你谈论在粗粒度文档和细粒度之间找到一个平衡时,你是正确的。



文档的最终架构实际上属于您的特定业务域需求。您必须在您的使用案例中识别整体需要的数据,然后根据您的存储文档形状。
以下是设计文档结构时需要执行的一些高级步骤:



  1. 确定您的应用/服务的所有文档使用情况。 (阅读,读写,可搜索项目)

  2. 设计文档(很可能最终会有几个较小的文档与一个包含所有内容的大文档)

  3. 设计可以在一个存储桶中共存的文档键,以用于不同的文档类型(例如在键值中使用命名空间)

  4. 根据您的使用情况执行结果模型的干运行

  5. 对您的用例运行性能测试(请尝试使用模拟器)预期负载至少高出2倍)


你设计不同的文档可以有某种冗余(记住它不是具有规范化形式的RDBMS)。它更像是面向对象设计。



注意2: 如果您的键之外有可搜索的项目(例如,按姓氏开头和其他一些动态搜索条件搜索客户)可以考虑使用与CB的ElasticSearch集成,或者也可以尝试使用N1QL查询语言CB3.0。



看起来你要朝着正确的方向前进,分割成几个更小的文档,所有文档都由MSISDN链接,例如:MSISDN:profile,MSISDN:revenue,MSISDN :选择参加。我会特别注意你最后一个文件类型A / B的连接。听起来像它可能会产生大量的和在本质上瞬态的...所以你必须找出这些文件多长时间住在Couchbase桶。您可以指定TTL(生存时间),以便将旧文档自动清除。


NoSQL databases like Couchbase do hold a lot of documents in memory, hence their enormous speed but it's also putting a greater demand on the memory size of the server(s) it's running on.

I'm looking for the best strategy between several contrary strategies of storing documents in a NoSQL database. These are:

  • Optimise for speed

Putting the whole information into one (big) document has the advantage that with a single GET the information can be retrieved from memory or from disk (if it was purged from memory before). With the schema-less NoSQL databases this almost wished. But eventually the document will become too big and eat up a lot of memory, less documents will be able to be kept in memory in total

  • Optimise for memory

Splitting up all documents into several documents (eg using compound keys as what is described in this question: Designing record keys for document-oriented database - best practice especially when those documents would only hold information that is necessary in a specific Read/Update operation would allow more (transient) documents to be held in memory.

The use case I'm looking at is Call Detail Records (CDR's) from Telecommunication Providers. These CDR's all go into hundreds of millions typically per day. Yet, many of these customer don't provide a single record on each given day (I'm looking at the South-East Asian market with it's Prepaid dominance and still less data saturation). That would mean that typically a large number of documents are having a Read/Update maybe every other day, only a small percentage will have several Read/Update cycles per day.

One solution that was suggested to me is to build 2 buckets, with more RAM being allocated to the more transient ones and less RAM being allocated to the second bucket holding the bigger documents. That would allow a faster access to the more transient data and more slower one to the bigger document which eg holds profile/user information that isn't changing at all. I do see two downsides to this proposal though, one is that you can't build a view (Map/Reduce) across two buckets (this is specifically for Couchbase, other NoSQL solution might allow this) and the second one would be more overhead in managing closely the balance between the memory allocation for both buckets as the user base growths.

Has anyone else being challenged by this and what was your solution to that problem? What would be the best strategy from your POV and why? Clearly it most be something in the middle of both strategies, having only one document or having one big document split up into hundreds of documents can't be the ideal solution IMO.

EDIT 2014-9-14 Ok, though that comes close to answering my own question but in absence of any offered solution so far and following a comment here is a bit more background how I now plan to organise my data, trying to achieve a sweet spot between speed and memory consumption:

Mobile_No:Profile

  • this holds profile information from a table, not directly from a CDR. Less transient data goes in here like age, gender and name. The key is a compound key consisting of the mobile number (MSISDN) and the word profile, separated by a ":"

Mobile_No:Revenue

  • this holds transient information like usage counters and variables accumulating the total revenue the customer spent. The key is again a compound key consisting of the mobile number (MSISDN) and the word revenue, separated by a ":"

Mobile_No:Optin

  • this holds semi transient information about when a customer opted into the program and when he/she opted out of the program again. This can happen several times and is handled via an array. The key is again a compound key consisting of the mobile number (MSISDN) and the word optin, separated by a ":"

Connection_Id

  • this holds information about a specific A/B connection (sender/receiver) which was done via voice or video call or SMS/MMS. The key is consisting of both mobile_no's which are concatenated.

Before these changes in the document structure I was putting all the profile, revenue and optin information in one big document, always keeping the connection_id as a separate document. This new document storing strategy gives me hopefully a better compromise between speed and memory consumption as I split the main document into several documents so that each of them has only the important information that is read/updated in a single step of the app.

This also takes care of the different rate of changes over time with some data being very transient (like the counters and the accumulative revenue field that gets updated with every CDR coming in) and the profile information being mostly unchanged. I do hope this gives a better understanding of what I'm trying to achieve, comments and feedback is more than welcome.

解决方案

Thank you for updating your original question. You are correct when you talking about finding a right balance between coarse grained documents vs. fine grained.

The final architecture of the documents actually falls under your particular business domain needs. You have to identify in your use cases "chunks" of data that are needed as a whole and then base your stored documents shape on this. Here are some high level steps you need to perform when you design your documents structure:

  1. Identify all document consumption use cases for your app/service. (read, read-write, searchable items)
  2. Design your documents (most likely you will end up with several smaller documents vs one big doc that has everything)
  3. Design your document keys that can coexists in one bucket for different documents types (e.g. use namespace in the key value)
  4. Do "dry run" of the resulting model against your use cases to see of you have optimal (read/write) transactions to noSQL and all required document data with in the transaction.
  5. Run performance testing for your use cases (try simulate the expected load at least 2 times higher)

Note: When you design different docs its OK to have some sort of redundancy (remember its not RDBMS with normalized form) think of it more as Object Oriented Design.

Note2: If you have searchable items that outside of your keys (e.g. search customers by last name "starts with" and some other dynamic search criteria) consider using ElasticSearch integration with CB or you can also try N1QL query language that is coming with CB3.0.

It seems that you going in a right direction by splitting into several smaller documents all linked by a MSISDN e.g.: MSISDN:profile, MSISDN:revenue, MSISDN:optin. I would pay special attention to your last document type "A/B" connection. That sounds like it might generate large volume and in nature transient...so you have to find out how long these documents have to live in Couchbase bucket. You can specify TTL (time to live) so that old docs will be auto-cleared up.

这篇关于NoSQL数据库中最好的文档存储策略是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆