是否有可能提高 Mongoexport 的速度? [英] Is it possible to improve Mongoexport speed?

查看:133
本文介绍了是否有可能提高 Mongoexport 的速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 130M 行的 MongoDB 3.6.2.0 集合.它有几个简单的字段和 2 个带有嵌套 JSON 文档的字段.数据以压缩格式(zlib)存储.

I have a 130M rows MongoDB 3.6.2.0 collection. It has several simple fields and 2 fields with nested JSON documents. Data is stored in compressed format (zlib).

我需要尽快将嵌入的字段之一导出为 JSON 格式.但是,mongoexport 需要永远.运行 12 小时后,它只处理了 5.5% 的数据,这对我来说太慢了.

I need to export one of embedded fields into JSON format as soon as possible. However, mongoexport is taking forever. After 12 hours of running it has processed only 5.5% of data, which is too slow for me.

CPU 不忙.Mongoexport 似乎是单线程的.

The CPU is not busy. Mongoexport seems to be single-threaded.

我正在使用的导出命令:

Export command I am using:

mongoexport -c places --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json

它实际上是 getMore 命令,它在引擎盖下非常慢:

It's actually getMore command which is unreasonably slow under the hood:

2018-05-02T17:59:35.605-0700 I COMMAND  [conn289] command maps.places command: getMore { getMore: 14338659261, collection: "places", $db: "maps" } originatingCommand: { find: "places", filter: {}, sort: {}, projection: { _id: 1, API: 1 }, skip: 0, snapshot: true, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:14338659261 keysExamined:0 docsExamined:5369 numYields:1337 nreturned:5369 reslen:16773797 locks:{ Global: { acquireCount: { r: 2676 } }, Database: { acquireCount: { r: 1338 } }, Collection: { acquireCount: { r: 1338 } } } protocol:op_query 22796ms

我尝试在这样的单独进程中使用 --SKIP--LIMIT 选项运行多个命令

I have tried running multiple commands with --SKIP and --LIMIT options in separate processes like this

mongoexport -c places --SKIP 10000000 --LIMIT 10000000 --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json
mongoexport -c places --SKIP 20000000 --LIMIT 10000000 --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json

等等.但是我无法等到第一个非零 SKIP 的命令甚至开始!

etc. But I was not able to finish waiting till the command with first non-zero SKIP even starts!

我也尝试过 --forceTableScan 选项,但没有任何区别.

I have also tried with --forceTableScan option, which did not make any difference.

我在 Places 表上没有索引.

I have no indexes on places table.

我的存储配置:

journal.enabled: false
wiredTiger.collectionConfig.blockCompressor: zlib

收藏统计:

'ns': 'maps.places',
'size': 2360965435671,
'count': 130084054,
'avgObjSize': 18149,
'storageSize': 585095348224.0

我的服务器规格:

Windows Server 2012 R2 x64
10Gb RAM 4TB HDD 6 cores Xeon 2.2Ghz

我进行了一项测试,使用 SSD 时,它的读取吞吐量与使用 HDD 时一样糟糕.

I've run a test and with SSD it's having the same terrible read throughput as with HDD.

我的问题:

为什么阅读这么慢?有没有其他人遇到过同样的问题?你能给我一些关于如何加速数据转储的提示吗?

Why is reading so slow? Has anyone else experienced the same issue? Can you give me any hints on how to speed up data dumping?

我将数据库移至快速 NVME SSD 驱动器,我想现在我可以更清楚地表达我对 MongoDB 读取性能的担忧.

I moved the DB to fast NVME SSD drives and I think now I can state my concerns about MongoDB read performance in a more clear way.

为什么这个命令会寻找没有特定字段的文档块:

Why does this command, which seeks to find a chunk of documents not having specific field:

2018-05-05T07:20:46.215+0000 I COMMAND  [conn704] command maps.places command: find { find: "places", filter: { HTML: { $exists: false }, API.url: { $exists: true } }, skip: 9990, limit: 1600, lsid: { id: UUID("ddb8b02c-6481-45b9-9f84-cbafa586dbbf") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:15881327065 keysExamined:0 docsExamined:482851 numYields:10857 nreturned:101 reslen:322532 locks:{ Global: { acquireCount: { r: 21716 } }, Database: { acquireCount: { r: 10858 } }, Collection: { acquireCount: { r: 10858 } } } protocol:op_query 177040ms

在快速闪存驱动器上仅产生 50Mb/秒的读取压力?这显然是单线程随机(分散)读取的性能.而我刚刚证明该驱动器可以轻松实现 1Gb/秒的读/写吞吐量.

only yields 50Mb/sec read pressure onto a fast flash drive? This is clearly performance of a single-threaded random (scattered) read. Whereas I have just proven that the drive allows 1Gb/sec read/write throughput easily.

就 Mongo 内部而言,按顺序读取 BSON 文件并获得 20 倍的扫描速度提升不是更明智吗?(而且,由于我的块是 zlib 压缩的,并且服务器有 16 个内核,最好在一个或多个帮助线程中解码获取的块?)而不是一个接一个地迭代 BSON 文档.

In terms of Mongo internals, would it not be wiser to read BSON file in a sequential order and gain 20x scanning speed improvement? (And, since my blocks are zlib compressed, and server has 16 cores, better to decode fetched chunks in one or several helper threads?) Instead of iterating BSON document after document.

我也可以确认,即使我没有指定任何查询过滤器,并且显然想要迭代整个集合,也不会发生 BSON 文件的快速顺序读取.

I also can confirm, even when I'm not specifying any query filters, and clearly want to iterate ENTIRE collection, fast sequential read of the BSON file is not happening.

推荐答案

限制导出性能的因素有很多.

There are many factors that are limiting the export performance.

  • 与可用内存相比,数据大小相对较大:~2 TB 与~5 GB WiredTiger 缓存(如果设置为默认值).那是:
    • 整个 WiredTiger 缓存只能包含最多约 0.22% 的集合,实际上很可能比这少得多,因为缓存将包含来自其他集合和索引的数据.
    • 这意味着 WiredTiger 需要非常频繁地从磁盘获取数据,同时逐出缓存的当前内容.如果副本集正在被积极使用,这意味着从缓存中驱逐脏"数据并将它们持久化到磁盘,这需要时间.
    • 请注意,WiredTiger 缓存中的文档未压缩.
    • The data size is relatively large compared to available memory: ~2 TB vs. ~5 GB WiredTiger cache (if set to default). That is:
      • The whole WiredTiger cache can only contain at best ~0.22% of the collection, in reality it's very likely much less than this since the cache would contain data from other collections and indexes.
      • This means that WiredTiger needs to fetch from disk very frequently, while evicting the current content of the cache. If the replica set is being actively used, this would mean evicting "dirty" data from the cache and persisting them to disk, which would take time.
      • Note that documents inside the WiredTiger cache are not compressed.

      一个可能的改进是,如果这是您经常执行的操作,请在相关字段上创建索引并使用 覆盖查询 可以提高性能,因为索引将小于完整文档.

      One possible improvement is that if this is an operation that you do frequently, creating an index on the relevant fields and exporting it using a covered query could improve performance since the index would be smaller than the full documents.

      在这种情况下并行运行 mongoexport 可能会有所帮助:

      Running mongoexport in parallel may be helpful in this case:

      根据提供的其他信息,我进行了一项测试,似乎在一定程度上缓解了这个问题.

      Further from the additional information provided, I ran a test that seems to alleviate this issue somewhat.

      似乎并行运行 mongoexport,其中每个 mongoexport 处理集合的子集可能能够加速导出.

      It seems that running mongoexport in parallel, where each mongoexport handling a subset of the collection might be able to speed up the export.

      为此,根据您计划运行的 mongoexport 进程的数量划分 _id 命名空间.

      To do this, divide the _id namespace corresponding to the number of mongoexport process you plan to run.

      例如,如果我有 200,000 个文档,从 _id:0 开始到 _id:199,999 并使用 2 个 mongoexport 进程:

      For example, if I have 200,000 documents, starting with _id:0 to _id:199,999 and using 2 mongoexport processes:

      mongoexport -q '{"_id":{"$gte":0, "$lt":100000}}' -d test -c test > out1.json &
      mongoexport -q '{"_id":{"$gte":100000, "$lt":200000}}' -d test -c test > out2.json &
      

      在上面的示例中,两个 mongoexport 进程各自处理集合的一半.

      where in the above example, the two mongoexport processes are each handling one half of the collection.

      使用 1 个流程、2 个流程、4 个流程和 8 个流程测试此工作流程,我得出以下时间点:

      Testing this workflow with 1 process, 2 processes, 4 processes, and 8 processes I arrive at the following timings:

      使用 1 个进程:

      real    0m32.720s
      user    0m33.900s
      sys 0m0.540s
      

      2 个进程:

      real    0m16.528s
      user    0m17.068s
      sys 0m0.300s
      

      4 个进程:

      real    0m8.441s
      user    0m8.644s
      sys 0m0.140s
      

      8 个进程:

      real    0m5.069s
      user    0m4.520s
      sys 0m0.364s
      

      根据可用资源,并行运行 8 个 mongoexport 进程似乎可以将进程加速约 6 倍.这是在 8 核机器上测试的.

      Depending on the available resources, running 8 mongoexport processes in parallel seems to speed up the process by a factor of ~6. This was tested in a machine with 8 cores.

      注意:halfer 的答案在思想上是相似的,尽管这个答案基本上是试图看看并行调用 mongoexport 是否有任何好处.

      Note: halfer's answer is similar in idea, although this answer basically tries to see if there's any benefit in calling mongoexport in parallel.

      这篇关于是否有可能提高 Mongoexport 的速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆