在 Node.js 中解析巨大的日志文件 - 逐行读取 [英] Parsing huge logfiles in Node.js - read in line-by-line

查看:22
本文介绍了在 Node.js 中解析巨大的日志文件 - 逐行读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对 Javascript/Node.js 中的大型(5-10 Gb)日志文件进行一些解析(我使用的是 Cube).

I need to do some parsing of large (5-10 Gb)logfiles in Javascript/Node.js (I'm using Cube).

日志行看起来像:

10:00:43.343423 I'm a friendly log message. There are 5 cats, and 7 dogs. We are in state "SUCCESS".

我们需要读取每一行,做一些解析(例如去掉57SUCCESS),然后将这些数据泵入Cube (https://github.com/square/cube) 使用他们的 JS 客户端.

We need to read each line, do some parsing (e.g. strip out 5, 7 and SUCCESS), then pump this data into Cube (https://github.com/square/cube) using their JS client.

首先,Node 中逐行读取文件的规范方式是什么?

Firstly, what is the canonical way in Node to read in a file, line by line?

这似乎是网上相当普遍的问题:

It seems to be fairly common question online:

很多答案似乎都指向一堆第三方模块:

A lot of the answers seem to point to a bunch of third-party modules:

然而,这似乎是一项相当基本的任务 - 当然,stdlib 中有一种简单的方法可以逐行读取文本文件?

However, this seems like a fairly basic task - surely, there's a simple way within the stdlib to read in a textfile, line-by-line?

其次,我需要处理每一行(例如,将时间戳转换为 Date 对象,并提取有用的字段).

Secondly, I then need to process each line (e.g. convert the timestamp into a Date object, and extract useful fields).

最大程度提高吞吐量的最佳方法是什么?是否有某种方法不会阻止每行读取或将其发送到 Cube?

What's the best way to do this, maximising throughput? Is there some way that won't block on either reading in each line, or on sending it to Cube?

第三 - 我猜测使用字符串拆分,JS 等价于 contains (IndexOf != -1?) 会比正则表达式快很多?有没有人在 Node.js 中解析大量文本数据方面有丰富的经验?

Thirdly - I'm guessing using string splits, and the JS equivalent of contains (IndexOf != -1?) will be a lot faster than regexes? Has anybody had much experience in parsing massive amounts of text data in Node.js?

干杯,维克多

推荐答案

我搜索了一种使用流逐行解析超大文件 (gbs) 的解决方案.所有第三方库和示例都不适合我的需求,因为它们不是逐行处理文件(如 1 、 2 、 3 、 4 ..)或将整个文件读入内存

I searched for a solution to parse very large files (gbs) line by line using a stream. All the third-party libraries and examples did not suit my needs since they processed the files not line by line (like 1 , 2 , 3 , 4 ..) or read the entire file to memory

以下解决方案可以使用 stream & 逐行解析非常大的文件.管道.为了测试,我使用了一个包含 17.000.000 条记录的 2.1 GB 文件.Ram 使用量不超过 60 mb.

The following solution can parse very large files, line by line using stream & pipe. For testing I used a 2.1 gb file with 17.000.000 records. Ram usage did not exceed 60 mb.

首先,安装 event-stream 包:

npm install event-stream

那么:

var fs = require('fs')
    , es = require('event-stream');

var lineNr = 0;

var s = fs.createReadStream('very-large-file.csv')
    .pipe(es.split())
    .pipe(es.mapSync(function(line){

        // pause the readstream
        s.pause();

        lineNr += 1;

        // process line here and call s.resume() when rdy
        // function below was for logging memory usage
        logMemoryUsage(lineNr);

        // resume the readstream, possibly from a callback
        s.resume();
    })
    .on('error', function(err){
        console.log('Error while reading file.', err);
    })
    .on('end', function(){
        console.log('Read entire file.')
    })
);

请告诉我进展如何!

这篇关于在 Node.js 中解析巨大的日志文件 - 逐行读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆