在Nodejs中解析大型JSON文件 [英] Parse large JSON file in Nodejs

查看:715
本文介绍了在Nodejs中解析大型JSON文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个以JSON格式存储许多JavaScript对象的文件,我需要读取文件,创建每个对象,并使用它们执行某些操作(在我的情况下将它们插入到数据库中)。 JavaScript对象可以表示为一种格式:



格式A:

  [{name:'thing1'},
....
{name:'thing999999999'}]
pre>

格式B:

  {name:'thing1'} //< ==我的选择。 
...
{name:'thing999999999'}

请注意 ... 表示很多JSON对象。我知道我可以将整个文件读入内存,然后使用 JSON.parse()这样:

  fs.readFile(filePath,'utf-8',function(err,fileContents){
if(err)throw err;
console.log(JSON.parse (fileContents));
});

但是,该文件可能真的很大,我更喜欢使用流来完成此操作。我看到的一个流的问题是,文件内容可能会被分解成数据块在任何时候,所以如何使用这个对象上的$ code> JSON.parse()?



理想情况下,每个对象都将被读取为单独的数据块,但我不确定如何执行

  var importStream = fs.createReadStream(filePath,{flags:'r',encoding:'utf-8'}); 
importStream.on('data',function(chunk){

var pleaseBeAJSObject = JSON.parse(chunk);
// insert pleaseBeAJSObject in a database
});
importStream.on('end',function(item){
console.log(Woot,import objects into the database!);
}); * /

请注意,我希望防止将整个文件读入内存。时间效率与我无关。是的,我可以一次尝试读取一些对象并将它们全部插入,但这是一个性能调整 - 我需要一种保证不会导致内存过载的方式,而不管文件中包含多少个对象。



我可以选择使用 FormatA FormatB 或也许别的什么,只是请在你的答案中说明。谢谢!

解决方案

要逐行处理文件,您只需要解除文件和代码的读取那个行为就是这个意见。您可以通过缓冲输入来完成此操作,直到您打开换行符。假设我们每行有一个JSON对象(基本上是格式B):

  var stream = fs.createReadStream(filePath,{flags :'r',encoding:'utf-8'}); 
var buf ='';

stream.on('data',function(d){
buf + = d.toString(); //数据读取时,将其隐藏在字符串缓冲区
pump(); //然后处理缓冲区
});

函数pump(){
var pos;

while((pos = buf.indexOf('\\\
'))> = 0){//继续进行,而缓冲区里有一个换行符
if(pos = = 0){//如果行中有多个换行符,缓冲区现在将以换行符
buf = buf.slice(1)开头; //丢弃它
继续; //所以下一个迭代将从数据
开始
processLine(buf.slice(0,pos)); // hand off the line
buf = buf.slice(pos + 1); //并将处理后的数据从缓冲区分片
}
}

函数processLine(line){//这里我们用一行做某事

if(line [line.length-1] =='\r')line = line.substr(0,line.length-1); //丢弃CR(0x0D)

if(line.length> 0){//忽略空行
var obj = JSON.parse(line); //解析JSON
console.log(obj); //在这里做数据!
}
}

每次文件流从文件系统接收数据,它被放在缓冲区中,然后调用 pump



如果缓冲区中没有换行符, code> pump 简单地返回而不做任何事情。下一次流获取数据时,更多的数据(和潜在的换行符)将被添加到缓冲区,然后我们将有一个完整的对象。



如果有一个换行符 pump 将缓冲区从开始切换到换行符,并将其移至进程。然后再检查缓冲区中是否有另一个换行符(循环)。这样,我们可以处理在当前块中读取的所有行。



最后,进程每输入一行调用一次。如果存在,它将删除回车符(以避免线结束– LF与CRLF的问题),然后调用 JSON.parse 一行。在这一点上,你可以用你的对象做任何你需要的事情。



请注意, JSON.parse 是严格的关于它接受什么作为输入;您必须使用双引号引用您的标识符和字符串值 。换句话说, {name:'thing1'} 会抛出错误;您必须使用 {name:thing1}



因为只有一块数据将一直在记忆中,这将是非常记忆效率。它也将非常快。一个快速测试显示,我在15ms内处理了10,000行。


I have a file which stores many JavaScript objects in JSON form and I need to read the file, create each of the objects, and do something with them (insert them into a db in my case). The JavaScript objects can be represented a format:

Format A:

[{name: 'thing1'},
....
{name: 'thing999999999'}]

or Format B:

{name: 'thing1'}         // <== My choice.
...
{name: 'thing999999999'}

Note that the ... indicates a lot of JSON objects. I am aware I could read the entire file into memory and then use JSON.parse() like this:

fs.readFile(filePath, 'utf-8', function (err, fileContents) {
  if (err) throw err;
  console.log(JSON.parse(fileContents));
});

However, the file could be really large, I would prefer to use a stream to accomplish this. The problem I see with a stream is that the file contents could be broken into data chunks at any point, so how can I use JSON.parse() on such objects?

Ideally, each object would be read as a separate data chunk, but I am not sure on how to do that.

var importStream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
importStream.on('data', function(chunk) {

    var pleaseBeAJSObject = JSON.parse(chunk);           
    // insert pleaseBeAJSObject in a database
});
importStream.on('end', function(item) {
   console.log("Woot, imported objects into the database!");
});*/

Note, I wish to prevent reading the entire file into memory. Time efficiency does not matter to me. Yes, I could try to read a number of objects at once and insert them all at once, but that's a performance tweak - I need a way that is guaranteed not to cause a memory overload, not matter how many objects are contained in the file.

I can choose to use FormatA or FormatB or maybe something else, just please specify in your answer. Thanks!

解决方案

To process a file line-by-line, you simply need to decouple the reading of the file and the code that acts upon that input. You can accomplish this by buffering your input until you hit a newline. Assuming we have one JSON object per line (basically, format B):

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var buf = '';

stream.on('data', function(d) {
    buf += d.toString(); // when data is read, stash it in a string buffer
    pump(); // then process the buffer
});

function pump() {
    var pos;

    while ((pos = buf.indexOf('\n')) >= 0) { // keep going while there's a newline somewhere in the buffer
        if (pos == 0) { // if there's more than one newline in a row, the buffer will now start with a newline
            buf = buf.slice(1); // discard it
            continue; // so that the next iteration will start with data
        }
        processLine(buf.slice(0,pos)); // hand off the line
        buf = buf.slice(pos+1); // and slice the processed data off the buffer
    }
}

function processLine(line) { // here's where we do something with a line

    if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D)

    if (line.length > 0) { // ignore empty lines
        var obj = JSON.parse(line); // parse the JSON
        console.log(obj); // do something with the data here!
    }
}

Each time the file stream receives data from the file system, it's stashed in a buffer, and then pump is called.

If there's no newline in the buffer, pump simply returns without doing anything. More data (and potentially a newline) will be added to the buffer the next time the stream gets data, and then we'll have a complete object.

If there is a newline, pump slices off the buffer from the beginning to the newline and hands it off to process. It then checks again if there's another newline in the buffer (the while loop). In this way, we can process all of the lines that were read in the current chunk.

Finally, process is called once per input line. If present, it strips off the carriage return character (to avoid issues with line endings – LF vs CRLF), and then calls JSON.parse one the line. At this point, you can do whatever you need to with your object.

Note that JSON.parse is strict about what it accepts as input; you must quote your identifiers and string values with double quotes. In other words, {name:'thing1'} will throw an error; you must use {"name":"thing1"}.

Because no more than a chunk of data will ever be in memory at a time, this will be extremely memory efficient. It will also be extremely fast. A quick test showed I processed 10,000 rows in under 15ms.

这篇关于在Nodejs中解析大型JSON文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆