如何使用 Scala 将 1 亿条记录加载到 MongoDB 中进行性能测试? [英] How to load 100 million records into MongoDB with Scala for performance testing?

查看:17
本文介绍了如何使用 Scala 将 1 亿条记录加载到 MongoDB 中进行性能测试?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用 Scala 编写的小脚本,旨在加载一个包含 100,000,000 条样本记录的 MongoDB 实例.这个想法是让数据库全部加载,然后进行一些性能测试(并在必要时调整/重新加载).

I have a small script that's written in Scala which is intended to load a MongoDB instance up with 100,000,000 sample records. The idea is to get the DB all loaded, and then do some performance testing (and tune/re-load if necessary).

问题在于每 100,000 条记录的加载时间呈线性增长.在我的加载过程开始时,加载这些记录只需要 4 秒.现在,在将近 6,000,000 条记录中,加载相同数量(100,000 条)需要 300 到 400 秒!这慢了两个数量级!查询仍然很快,但以这种速度,我永远无法加载我想要的数据量.

The problem is that the load-time per 100,000 records increases pretty linearly. At the beginning of my load process it took only 4 seconds to load those records. Now, at nearly 6,000,000 records, it's taking between 300 and 400 seconds to load the same amount (100,000)! That's two orders of magnitude slower! Queries are still snappy, but at this rate, I'll never be able to load the amount of data that I'd like.

如果我用我的所有记录(全部 100,000,000 条!)写出一个文件,然后使用 mongoimport 导入整个东西?还是我的期望太高而我使用的数据库超出了它应该处理的范围?

Will this work faster if I write a file out with all of my records (all 100,000,000!), and then use mongoimport to import the whole thing? Or are my expectations too high and I'm using the DB beyond what it's supposed to handle?

有什么想法吗?谢谢!

这是我的脚本:

import java.util.Date

import com.mongodb.casbah.Imports._
import com.mongodb.casbah.commons.MongoDBObject

object MongoPopulateTest {
  val ONE_HUNDRED_THOUSAND = 100000
  val ONE_MILLION          = ONE_HUNDRED_THOUSAND * 10

  val random     = new scala.util.Random(12345)
  val connection = MongoConnection()
  val db         = connection("mongoVolumeTest")
  val collection = db("testData")

  val INDEX_KEYS = List("A", "G", "E", "F")

  def main(args: Array[String]) {
    populateCoacs(ONE_MILLION * 100)
  }

  def populateCoacs(count: Int) {
    println("Creating indexes: " + INDEX_KEYS.mkString(", "))
    INDEX_KEYS.map(key => collection.ensureIndex(MongoDBObject(key -> 1)))

    println("Adding " + count + " records to DB.")

    val start     = (new Date()).getTime()
    var lastBatch = start

    for(i <- 0 until count) {
      collection.save(makeCoac())
      if(i % 100000 == 0 && i != 0) {
        println(i + ": " + (((new Date()).getTime() - lastBatch) / 1000.0) + " seconds (" +  (new Date()).toString() + ")")
        lastBatch = (new Date()).getTime()
      }
    }

    val elapsedSeconds = ((new Date).getTime() - start) / 1000

    println("Done. " + count + " COAC rows inserted in " + elapsedSeconds + " seconds.")
  }

  def makeCoac(): MongoDBObject = {
    MongoDBObject(
      "A" -> random.nextPrintableChar().toString(),
      "B" -> scala.math.abs(random.nextInt()),
      "C" -> makeRandomPrintableString(50),
      "D" -> (if(random.nextBoolean()) { "Cd" } else { "Cc" }),
      "E" -> makeRandomPrintableString(15),
      "F" -> makeRandomPrintableString(15),
      "G" -> scala.math.abs(random.nextInt()),
      "H" -> random.nextBoolean(),
      "I" -> (if(random.nextBoolean()) { 41 } else { 31 }),
      "J" -> (if(random.nextBoolean()) { "A" } else { "B" }),
      "K" -> random.nextFloat(),
      "L" -> makeRandomPrintableString(15),
      "M" -> makeRandomPrintableString(15),
      "N" -> scala.math.abs(random.nextInt()),
      "O" -> random.nextFloat(),
      "P" -> (if(random.nextBoolean()) { "USD" } else { "GBP" }),
      "Q" -> (if(random.nextBoolean()) { "PROCESSED" } else { "UNPROCESSED" }),
      "R" -> scala.math.abs(random.nextInt())
    )
  }

  def makeRandomPrintableString(length: Int): String = {
    var result = ""
    for(i <- 0 until length) {
      result += random.nextPrintableChar().toString()
    }
    result
  }
}

这是我的脚本的输出:

Creating indexes: A, G, E, F
Adding 100000000 records to DB.
100000: 4.456 seconds (Thu Jul 21 15:18:57 EDT 2011)
200000: 4.155 seconds (Thu Jul 21 15:19:01 EDT 2011)
300000: 4.284 seconds (Thu Jul 21 15:19:05 EDT 2011)
400000: 4.32 seconds (Thu Jul 21 15:19:10 EDT 2011)
500000: 4.597 seconds (Thu Jul 21 15:19:14 EDT 2011)
600000: 4.412 seconds (Thu Jul 21 15:19:19 EDT 2011)
700000: 4.435 seconds (Thu Jul 21 15:19:23 EDT 2011)
800000: 5.919 seconds (Thu Jul 21 15:19:29 EDT 2011)
900000: 4.517 seconds (Thu Jul 21 15:19:33 EDT 2011)
1000000: 4.483 seconds (Thu Jul 21 15:19:38 EDT 2011)
1100000: 4.78 seconds (Thu Jul 21 15:19:43 EDT 2011)
1200000: 9.643 seconds (Thu Jul 21 15:19:52 EDT 2011)
1300000: 25.479 seconds (Thu Jul 21 15:20:18 EDT 2011)
1400000: 30.028 seconds (Thu Jul 21 15:20:48 EDT 2011)
1500000: 24.531 seconds (Thu Jul 21 15:21:12 EDT 2011)
1600000: 18.562 seconds (Thu Jul 21 15:21:31 EDT 2011)
1700000: 28.48 seconds (Thu Jul 21 15:21:59 EDT 2011)
1800000: 29.127 seconds (Thu Jul 21 15:22:29 EDT 2011)
1900000: 25.814 seconds (Thu Jul 21 15:22:54 EDT 2011)
2000000: 16.658 seconds (Thu Jul 21 15:23:11 EDT 2011)
2100000: 24.564 seconds (Thu Jul 21 15:23:36 EDT 2011)
2200000: 32.542 seconds (Thu Jul 21 15:24:08 EDT 2011)
2300000: 30.378 seconds (Thu Jul 21 15:24:39 EDT 2011)
2400000: 21.188 seconds (Thu Jul 21 15:25:00 EDT 2011)
2500000: 23.923 seconds (Thu Jul 21 15:25:24 EDT 2011)
2600000: 46.077 seconds (Thu Jul 21 15:26:10 EDT 2011)
2700000: 104.434 seconds (Thu Jul 21 15:27:54 EDT 2011)
2800000: 23.344 seconds (Thu Jul 21 15:28:17 EDT 2011)
2900000: 17.206 seconds (Thu Jul 21 15:28:35 EDT 2011)
3000000: 19.15 seconds (Thu Jul 21 15:28:54 EDT 2011)
3100000: 14.488 seconds (Thu Jul 21 15:29:08 EDT 2011)
3200000: 20.916 seconds (Thu Jul 21 15:29:29 EDT 2011)
3300000: 69.93 seconds (Thu Jul 21 15:30:39 EDT 2011)
3400000: 81.178 seconds (Thu Jul 21 15:32:00 EDT 2011)
3500000: 93.058 seconds (Thu Jul 21 15:33:33 EDT 2011)
3600000: 168.613 seconds (Thu Jul 21 15:36:22 EDT 2011)
3700000: 189.917 seconds (Thu Jul 21 15:39:32 EDT 2011)
3800000: 200.971 seconds (Thu Jul 21 15:42:53 EDT 2011)
3900000: 207.728 seconds (Thu Jul 21 15:46:21 EDT 2011)
4000000: 213.778 seconds (Thu Jul 21 15:49:54 EDT 2011)
4100000: 219.32 seconds (Thu Jul 21 15:53:34 EDT 2011)
4200000: 241.545 seconds (Thu Jul 21 15:57:35 EDT 2011)
4300000: 193.555 seconds (Thu Jul 21 16:00:49 EDT 2011)
4400000: 190.949 seconds (Thu Jul 21 16:04:00 EDT 2011)
4500000: 184.433 seconds (Thu Jul 21 16:07:04 EDT 2011)
4600000: 231.709 seconds (Thu Jul 21 16:10:56 EDT 2011)
4700000: 243.0 seconds (Thu Jul 21 16:14:59 EDT 2011)
4800000: 310.156 seconds (Thu Jul 21 16:20:09 EDT 2011)
4900000: 318.421 seconds (Thu Jul 21 16:25:28 EDT 2011)
5000000: 378.112 seconds (Thu Jul 21 16:31:46 EDT 2011)
5100000: 265.648 seconds (Thu Jul 21 16:36:11 EDT 2011)
5200000: 295.086 seconds (Thu Jul 21 16:41:06 EDT 2011)
5300000: 297.678 seconds (Thu Jul 21 16:46:04 EDT 2011)
5400000: 329.256 seconds (Thu Jul 21 16:51:33 EDT 2011)
5500000: 336.571 seconds (Thu Jul 21 16:57:10 EDT 2011)
5600000: 398.64 seconds (Thu Jul 21 17:03:49 EDT 2011)
5700000: 351.158 seconds (Thu Jul 21 17:09:40 EDT 2011)
5800000: 410.561 seconds (Thu Jul 21 17:16:30 EDT 2011)
5900000: 689.369 seconds (Thu Jul 21 17:28:00 EDT 2011)

推荐答案

一些提示:

  1. 在插入之前不要索引您的集合,因为插入会修改索引,这是一种开销.插入所有内容,然后创建索引.

  1. Do not index your collection before inserting, as inserts modify the index which is an overhead. Insert everything, then create index .

代替 "save" ,使用 mongoDB "batchinsert" 可以在 1 次操作中插入许多记录.因此,每批插入大约 5000 个文档.您将看到显着的性能提升.

instead of "save" , use mongoDB "batchinsert" which can insert many records in 1 operation. So have around 5000 documents inserted per batch. You will see remarkable performance gain .

查看插入方法#2 这里,它需要插入文档数组而不是单个文档.另请参阅此线程

see the method#2 of insert here, it takes array of documents to insert instead of single document. Also see the discussion in this thread

如果您想进行更多基准测试 -

And if you want to benchmark more -

这只是一个猜测,尝试使用预定义大尺寸的封顶集合来存储您的所有数据.Capped collection without index 具有非常好的插入性能.

This is just a guess, try using a capped collection of a predefined large size to store all your data. Capped collection without index has very good insertion performance.

这篇关于如何使用 Scala 将 1 亿条记录加载到 MongoDB 中进行性能测试?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆