在 Nodejs 中解析大型 JSON 文件 [英] Parse large JSON file in Nodejs

查看:39
本文介绍了在 Nodejs 中解析大型 JSON 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,它以 JSON 形式存储了许多 JavaScript 对象,我需要读取该文件,创建每个对象,并对其进行处理(在我的情况下将它们插入到 db 中).JavaScript 对象可以表示为一种格式:

I have a file which stores many JavaScript objects in JSON form and I need to read the file, create each of the objects, and do something with them (insert them into a db in my case). The JavaScript objects can be represented a format:

格式 A:

[{name: 'thing1'},
....
{name: 'thing999999999'}]

格式 B:

{name: 'thing1'}         // <== My choice.
...
{name: 'thing999999999'}

请注意,... 表示很多 JSON 对象.我知道我可以将整个文件读入内存,然后像这样使用 JSON.parse() :

Note that the ... indicates a lot of JSON objects. I am aware I could read the entire file into memory and then use JSON.parse() like this:

fs.readFile(filePath, 'utf-8', function (err, fileContents) {
  if (err) throw err;
  console.log(JSON.parse(fileContents));
});

但是,文件可能非常大,我更愿意使用流来完成此操作.我在流中看到的问题是文件内容可能随时被分解为数据块,那么我如何在这些对象上使用 JSON.parse() ?

However, the file could be really large, I would prefer to use a stream to accomplish this. The problem I see with a stream is that the file contents could be broken into data chunks at any point, so how can I use JSON.parse() on such objects?

理想情况下,每个对象都将作为一个单独的数据块读取,但我不确定如何做到这一点.

Ideally, each object would be read as a separate data chunk, but I am not sure on how to do that.

var importStream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
importStream.on('data', function(chunk) {

    var pleaseBeAJSObject = JSON.parse(chunk);           
    // insert pleaseBeAJSObject in a database
});
importStream.on('end', function(item) {
   console.log("Woot, imported objects into the database!");
});*/

注意,我希望防止将整个文件读入内存.时间效率对我来说并不重要.是的,我可以尝试一次读取多个对象并一次插入它们,但这是一个性能调整 - 我需要一种保证不会导致内存过载的方法,无论文件中包含多少个对象.

Note, I wish to prevent reading the entire file into memory. Time efficiency does not matter to me. Yes, I could try to read a number of objects at once and insert them all at once, but that's a performance tweak - I need a way that is guaranteed not to cause a memory overload, not matter how many objects are contained in the file.

我可以选择使用 FormatAFormatB 或其他东西,请在您的答案中指定.谢谢!

I can choose to use FormatA or FormatB or maybe something else, just please specify in your answer. Thanks!

推荐答案

要逐行处理文件,您只需要将文件的读取与对输入起作用的代码分离.您可以通过缓冲输入直到遇到换行符来完成此操作.假设我们每行有一个 JSON 对象(基本上是格式 B):

To process a file line-by-line, you simply need to decouple the reading of the file and the code that acts upon that input. You can accomplish this by buffering your input until you hit a newline. Assuming we have one JSON object per line (basically, format B):

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var buf = '';

stream.on('data', function(d) {
    buf += d.toString(); // when data is read, stash it in a string buffer
    pump(); // then process the buffer
});

function pump() {
    var pos;

    while ((pos = buf.indexOf('
')) >= 0) { // keep going while there's a newline somewhere in the buffer
        if (pos == 0) { // if there's more than one newline in a row, the buffer will now start with a newline
            buf = buf.slice(1); // discard it
            continue; // so that the next iteration will start with data
        }
        processLine(buf.slice(0,pos)); // hand off the line
        buf = buf.slice(pos+1); // and slice the processed data off the buffer
    }
}

function processLine(line) { // here's where we do something with a line

    if (line[line.length-1] == '
') line=line.substr(0,line.length-1); // discard CR (0x0D)

    if (line.length > 0) { // ignore empty lines
        var obj = JSON.parse(line); // parse the JSON
        console.log(obj); // do something with the data here!
    }
}

每次文件流从文件系统接收数据时,它都会被存放在一个缓冲区中,然后pump被调用.

Each time the file stream receives data from the file system, it's stashed in a buffer, and then pump is called.

如果缓冲区中没有换行符,pump 直接返回而不做任何事情.下一次流获取数据时,更多数据(可能还有换行符)将添加到缓冲区中,然后我们将拥有一个完整的对象.

If there's no newline in the buffer, pump simply returns without doing anything. More data (and potentially a newline) will be added to the buffer the next time the stream gets data, and then we'll have a complete object.

如果有换行符,pump 将缓冲区从开始到换行符切片,并将其交给 process.然后它再次检查缓冲区中是否有另一个换行符(while 循环).这样,我们就可以处理当前chunk中所有读过的行了.

If there is a newline, pump slices off the buffer from the beginning to the newline and hands it off to process. It then checks again if there's another newline in the buffer (the while loop). In this way, we can process all of the lines that were read in the current chunk.

最后,每个输入行调用一次 process.如果存在,它会去掉回车符(以避免出现行尾 – LF 与 CRLF 的问题),然后调用 JSON.parse 一行.此时,您可以对您的对象做任何您需要做的事情.

Finally, process is called once per input line. If present, it strips off the carriage return character (to avoid issues with line endings – LF vs CRLF), and then calls JSON.parse one the line. At this point, you can do whatever you need to with your object.

请注意,JSON.parse 对它接受的输入内容是严格的;您必须用双引号引用您的标识符和字符串值.换句话说,{name:'thing1'} 会抛出错误;你必须使用 {"name":"thing1"}.

Note that JSON.parse is strict about what it accepts as input; you must quote your identifiers and string values with double quotes. In other words, {name:'thing1'} will throw an error; you must use {"name":"thing1"}.

因为一次在内存中不会超过一大块数据,所以这将是非常高效的内存.它也会非常快.快速测试显示我在 15 毫秒内处理了 10,000 行.

Because no more than a chunk of data will ever be in memory at a time, this will be extremely memory efficient. It will also be extremely fast. A quick test showed I processed 10,000 rows in under 15ms.

这篇关于在 Nodejs 中解析大型 JSON 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆