使用nodejs将非常大的记录集导入MongoDB [英] Importing a very large record set into MongoDB using nodejs

查看:79
本文介绍了使用nodejs将非常大的记录集导入MongoDB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在深入探讨我的问题之前,我想指出,我正在部分地这样做是为了熟悉node和mongo.我意识到可能有更好的方法可以实现我的最终目标,但是我想摆脱的是一种通用的方法论,该方法论可能适用于其他情况.

Before I dive into my question, I wanted to point out that I am doing this partially to get familiarized with node and mongo. I realize there are probably better ways to accomplish my final goal, but what I want to get out of this is a general methodology that might apply to other situations.

目标:

我有一个csv文件,其中包含6+百万条geo-ip记录.每条记录总共包含4个字段,文件大小约为180mb.

I have a csv file containing 6+ million geo-ip records. Each record contains 4 fields in total and the file is roughly 180mb.

我要处理此文件,并将每条记录插入一个名为块"的MongoDB集合中.每个块"将包含csv文件中的4个字段.

I want to process this file and insert each record into a MongoDB collection called "Blocks". Each "Block" will have the 4 fields from the csv file.

我目前的做法

我正在使用猫鼬创建块"模型,并使用ReadStream逐行处理文件.我用于处理文件并提取记录的代码可以正常工作,并且我可以根据需要将其打印到控制台.

I am using mongoose to create a "Block" model and a ReadStream to process the file line by line. The code I'm using to process the file and extract the records works and I can make it print each record to the console if I want to.

对于文件中的每个记录,它调用一个函数,该函数创建一个新的Blocks对象(使用猫鼬),填充字段并保存.

For each record in the file, it calls a function that creates a new Blocks object (using mongoose), populates the fields and saves it.

这是函数的内部代码,每次读取和解析一行时都会调用该代码. "rec"变量包含一个表示文件中单个记录的对象.

block = new Block();

block.ipFrom    = rec.startipnum;
block.ipTo      = rec.endipnum;
block.location  = rec.locid;

connections++;

block.save(function(err){

    if(err) throw err;
    //console.log('.');
    records_inserted++;

    if( --connections == 0 ){
        mongoose.disconnect();
        console.log( records_inserted + ' records inserted' );
    }

});

问题

由于文件是异步读取的,因此同时处理了多行,并且读取文件的速度比MongoDB可以写入的速度快得多,因此整个过程停滞在282000条左右的记录中,并高达5k +并发Mongo连接.它不会崩溃..它只是坐在那里什么也不做,似乎也无法恢复,而且mongo集合中的项目数也没有进一步增加.

Since the file is being read asynchronously, more than one line is processed at the same time and reading the file is much faster than MongoDB can write so the whole process stalls at around 282000 records and gets as high up as 5k+ concurrent Mongo connections. It doesn't crash.. it just sits there doing nothing and doesn't seem to recover, nor does the item count in the mongo collection go up any further.

我在这里追求的是解决此问题的一般方法.如何限制并发Mongo连接的数量?我想利用能够同时插入多个记录的优势,但是我缺少一种调节流程的方法.

What I'm after here is a general approach to solving this problem. How would I cap the number of concurrent Mongo connections? I would like to take advantage of being able to insert multiple records at the same time, but I'm missing a way to regulate the flow.

谢谢.

推荐答案

我会尝试从Mongodb中使用命令行CSV导入选项-它应该执行您要执行的操作,而无需编写任何代码

I would try the commandline CSV import option from Mongodb - it should do what you are after without having to write any code

这篇关于使用nodejs将非常大的记录集导入MongoDB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆