在 MongoDB 中读取最快的结构是什么:多个文档还是子文档? [英] What is fastest structure to read in MongoDB: multiple documents or subdocuments?

查看:57
本文介绍了在 MongoDB 中读取最快的结构是什么:多个文档还是子文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简介

我使用 Mongo 来存储中等长度的金融时间序列,我可以通过两种方式读取:

I use Mongo to store moderately long financial timeseries, which I can read in 2 ways:

  • 检索 1 个系列的整个长度

  • retrieve 1 series for its entire length

在特定日期检索 N 系列

retrieve N series on a specific date

为了方便第二种查询,我按年份对系列进行了切片.这减少了在特定日期查询大量序列时的数据负载(例如:如果我查询特定日期的 1000 个 timeseries 的值,则查询每个的整个历史记录是不可行的,可以回溯 40年 = 每个 28k)

To facilitate the second type of query, I slice the series by year. This reduces the data load when querying for large number of series on a specific day (example: if I query the value of 1000 timeseries on a specific day, it is not feasible to query back the entire history of each, which can go back 40 years = 28k each)

问题

写入对时间不敏感.储物空间充足.读取具有时间敏感性.为第一类和第二类快速读取存档数据的最佳选择是什么?

Writes are not time-sensitive. Storage space is plentiful. Reads are time-sensitive. What is the best option to archive data for fast reads of both first and second kind?

选项 A - 单独的文档

{_id:xxx, stock:IBM, year:2014, prices:[<daily prices for 2014>]}
{_id:xxx, stock:IBM, year:2015, prices:[<daily prices for 2015>]}

在选项 A 中,我会在 yearstock

In option A, I would find() with a compound index on year and stock

选项 B - 子文档

{
 _id:xxx,
 stock:IBM,
 2014:[<daily prices for 2014>],
 2015:[<daily prices for 2015>],
 }

在选项 B 中,我将 find()stock 上的一个简单索引上,并添加一个投影以仅返回 year 我寻找

In option B, I would find() on a simple index on stock, and add a projection to only return the year I look for

选项 B.1 - 带有压缩内容的子文档

同上,但 <201x 的每日价格> 是通过 jsoning 和 zlibbing 压缩的

Same as above, but the <daily prices for 201x> are zipped by jsoning and zlibbing them

选项 C - 包含每日数据的子文档

{
 _id:xxx,
 stock:IBM,
     0:<price for day 0 of 2014>,
     1:<price for day 1 of 2014>,
     ...
     n:<price for day n of 2015>,  //n can be as large as 10.000
 }

选项 D - 嵌套子文档

{
 _id:xxx,
 stock:IBM,
 2014:{
     0:<price for day 0>,
     1:<price for day 1>,
     ...
     }
 2015:{
     0:<price for day 0>,
     1:<price for day 1>,
     ...
     }

然后我将不得不应用像this这样的查询方法.请注意,选项 D 可能会使执行上述第一种类型读取所需的数据加倍.

I would then have to apply a query approach like this. To note that option D might double the data required to do a read of the first type described above.

推荐答案

嗯,我想我可以改进你的模型更容易:

Hm, I think I can improve your model to be easier:

{
  _id: new ObjectId()
  key: "IBM",
  date: someISODate,
  price: somePrice,
  exchange: "NASDAQ"
}
db.stocks.createIndex({key:1, date:1, exchange:1})

在此模型中,您拥有所需的所有信息:

In this model, you have all the information you need:

db.stocks.find({
  key: "IBM", 
  date: { 
    $gte: new ISODate("2014-01-01T00:00:00Z"),
    $lt: new ISODate("2015-01-01T00:00:00Z")
  }
})

例如,如果您想知道 2014 年 5 月 IBM 股票的平均价格,您可以使用聚合:

For example, if you wanted to know the average price of the IBM stock in May 2014, you'd use an aggregation:

db.stocks.aggregate([
  { $match: {
      key: "IBM",
      date:{
        $gte: new ISODate("2014-05-01T00:00:00Z"),
        $lt: new ISODate("2014-06-01T00:00:00Z")
      }
  },
  { $group: {
      _id: {
        stock: "$key",
        month: { $month:"$date"},
        year: { $year:"$date" }
      },
      avgPrice: {$avg: "$price" }
    }
  }
]}

将导致返回的文档如下:

Would result in a returned document like:

{
  _id: {
    stock: "IBM",
    year: "2014",
    month: "5"
  },
  avgPrice: "8000.42"
}

您甚至可以很容易地预先计算每只股票和每个月的平均值

You could even precalculate the averages for every stock and every month rather easily

db.stocks.aggregate([
  {
    $group: {
        _id: {
          stock: "$key",
          month: { $month: "$date" },
          year: { $year: "$date" }
        },
        averagePrice: {$avg:"$price"}
    }
  },
  { $out: "avgPerMonth" }
]}

找到 IBM 在 2014 年 5 月的平均值现在变成了一个简单的查询:

finding the average for IBM in May 2014 now becomes a simple query:

db.avgPerMonth.find({
   "_id":{
     "stock":"IBM",
     "month":"5",
     "year":"2014"
   }
})

等等.您真的想对股票使用聚合.例如:IBM 股票在历史上最贵的月份是哪一个月?"

And so on. You really want to use aggregations with stocks. For example: "In which month of the year was the IBM stock most expensive historically?"

很好,简单,具有最佳的读取和写入性能.此外,您保存了多个 $unwind 语句(对于无论如何都不太容易的任意键)也用于聚合查询.

Nice, easy, with optimum performance for both reads and writes. Also, you save multiple $unwind statements (for arbitrary keys that's not too easy, anyway) for the aggregation queries, too.

当然,我们有 key 重复值的冗余,但我们规避了一些问题:

Granted, we have the redundancy of the duplicate values for key, but we circumvent a few problems:

  1. BSON 文档大小限制为 16MB,因此您的模型会施加理论限制.
  2. 当使用 MongoDB 的 mmapv1 存储引擎(这是 MongoDB <3.0 唯一可用的,默认为 > 3.0)时,扩展文档的大小可能会导致数据文件中的文档迁移相当昂贵,因为文档保证永不破碎.
  3. 复杂的模型导致复杂的代码.复杂的代码更难维护.代码越难维护,需要的时间就越长.一项任务所需的时间越长,代码维护成本就越高(在金钱和/或时间方面).结论:复杂模型比简单模型更昂贵.
  1. BSON documents are limited to a size of 16MB, so your model would impose a theoretical limit.
  2. When using the mmapv1 storage engine of MongoDB (which is the only one available for MongoDB < 3.0 and default for > 3.0), expanding the size of documents may result in a rather expensive document migration within a data file, since documents are guaranteed to never be fragemented.
  3. Complicated models lead to complicated code. Complicated code is harder to maintain. The harder code is to maintain, the longer it takes. The longer you need for a task, the more expensive (money wise and/or time wise) code maintenance becomes. Conclusion: Complicated models are more expensive than easy ones.

编辑

对于日期,您需要确保记住不同的时区,并且在进行聚合以精确到日期时将它们标准化为祖鲁时间保持在交易所的时区内.

Edit

For the dates, you need to make sure that you keep in mind the different time zones and either normalize them to Zulu Time stay within the time zone of an exchange when doing aggregations to be precise as of dates.

这篇关于在 MongoDB 中读取最快的结构是什么:多个文档还是子文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆