在NodeJS中的MongoDB中插入一个200,000行以上的大型CSV文件 [英] Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS

查看:157
本文介绍了在NodeJS中的MongoDB中插入一个200,000行以上的大型CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一个较大的csv文件解析并插入到MongoDB中,但是当该文件扩展100'000行时,我从服务器得到了错误的响应.而且我需要插入的文件通常在200'000行以上.

I'm trying to parse and insert a big csv file into MongoDB but when the file extends 100'000 rows I get a bad response from the server. And the files I need to insert are usually above 200'000 rows.

我已经尝试了批量插入(insertMany)和Babyparse(Papaparse)流方法来逐行插入文件.但是效果不佳.

I've tried both bulk insert (insertMany) and Babyparse(Papaparse) streaming approach to insert the file row by row. But with poor results.

节点api:

router.post('/csv-upload/:id', multipartMiddleware, function(req, res) {

    // Post vartiables
    var fileId = req.params.id;
    var csv = req.files.files.path;

    // create a queue object with concurrency 5
    var q = async.queue(function(row, callback) {
        var entry = new Entry(row);
        entry.save();
        callback();
    }, 5);

    baby.parseFiles(csv, {
        header: true, // Includes header in JSON
        skipEmptyLines: true,
        fastMode: true,
        step: function(results, parser) {
            results.data[0].id = fileId;

            q.push(results.data[0], function (err) {
                if (err) {throw err};
            });
        },
        complete: function(results, file) {
            console.log("Parsing complete:", results, file);
            q.drain = function() {
                console.log('All items have been processed');
                res.send("Completed!");
            };
        }
    });
});

这种流式传输方法导致: POST SERVER net :: ERR_EMPTY_RESPONSE

This streaming approach results in: POST SERVER net::ERR_EMPTY_RESPONSE

不确定我是否正确使用了async.queue.

Not sure if I'm using the async.queue correctly though.

是否有更好,更有效的方法来执行此操作,或者我做错了什么?

Is there a better and more efficient way to do this OR am I doing something wrong?

Express Server:

Express Server:

// Dependencies
var express = require('express');
var path = require('path');
var bodyParser = require('body-parser');
var routes = require('./server/routes');
var mongoose = require("mongoose");
var babel = require("babel-core/register");
var compression = require('compression');
var PORT = process.env.PORT || 3000;
// Include the cluster module
var cluster = require('cluster');

mongoose.connect(process.env.MONGOLAB_URI || 'mongodb://localhost/routes');

  // Code to run if we're in the master process
 if (cluster.isMaster) {

    // Count the machine's CPUs
    var cpuCount = require('os').cpus().length;

    // Create a worker for each CPU
    for (var i = 0; i < cpuCount; i += 1) {
        cluster.fork();
    }

 // Code to run if we're in a worker process
 } else {
    // Express
    var app = express();

    app.use(bodyParser.json({limit: '50mb'}));
    app.use(bodyParser.urlencoded({limit: '50mb', extended: true}));

    // Compress responses
    app.use(compression());

    // Used for production build
    app.use(express.static(path.join(__dirname, 'public')));

    routes(app);

    // Routes
    app.use('/api', require('./server/routes/api'));

    app.all('/*', function(req, res) {
        res.sendFile(path.join(__dirname, 'public/index.html'));
    });

    // Start server
    app.listen(PORT, function() {
        console.log('Server ' + cluster.worker.id + ' running on ' + PORT);
    });
}

推荐答案

处理导入:

一个好问题,根据我的经验,到目前为止,将csv插入mongo的最快方法是通过命令行:

Great question, from my experience by far the fastest way to insert a csv into mongo is via the command line:

mongoimport -d db_name -c collection_name --type csv --file file.csv --headerline 

我不相信猫鼬可以打电话给mongoimport(如果我错了,请纠正我)

I don't believe mongoose has a way of calling mongoimport (someone correct me if I'm wrong)

但是直接通过节点调用很简单:

But it's simple enough to call via node directly:

var exec = require('child_process').exec;
var cmd = 'mongoimport -d db_name -c collection_name --type csv --file file.csv --headerline';

exec(cmd, function(error, stdout, stderr) {
  // do whatever you need during the callback
});

上面的内容必须修改为动态的,但它应该是不言自明的.

The above will have to be modified to be dynamic, but it should be self-explanatory.

处理上传:

从前端客户端上载文件是另一个挑战.

Uploading the file from a front-end client is another challenge.

如果您向服务器发出请求并且在60秒内(可能是您上面所指的内容)没有收到响应,大多数浏览器将超时.

Most browsers will timeout if you make a request to a server and don't get a response within 60 seconds (probably what you are referring to above)

一种解决方案是打开套接字连接(在npm中搜索socket.io)以获取详细信息.这将创建与服务器的恒定连接,并且不受超时限制.

One solution would be to open a socket connection (search for socket.io in npm) for details. This will create a constant connection to the server and won't be subject to the timeout restrictions.

如果上传不是问题,并且由于解析/插入速度太慢而导致超时,那么在实现上述内容后,您可能不必担心这一点.

If uploading is not an issue, and the timeout is due to the slow parsing/inserting then you may not have to worry about this once you implement the above.

其他注意事项:

我不确定您需要发回给用户什么内容,或者需要进行什么解析.但这可以在正常的请求/响应周期之外完成,也可以在一个请求/响应周期内需要时在套接字连接期间进行处理.

I'm not sure exactly what you need to send back to the user, or what parsing needs to take place. But that can either be done outside of the normal request/response cycle, or can be handled during a socket connection if it's needed during one request/response cycle.

这篇关于在NodeJS中的MongoDB中插入一个200,000行以上的大型CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆