在Node.js中写入文件之前对数据流进行排序 [英] Sorting a data stream before writing to file in nodejs

查看:201
本文介绍了在Node.js中写入文件之前对数据流进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个输入文件,该文件可能包含多达1M条记录,每条记录看起来像这样

I have an input file which may potentially contain upto 1M records and each record would look like this

field 1 field 2 field3 \n

我想读取此输入文件并根据field3对其进行排序,然后再将其写入另一个文件.

I want to read this input file and sort it based on field3 before writing it to another file.

这是我到目前为止所拥有的

here is what I have so far

var fs = require('fs'),
    readline = require('readline'),
    stream = require('stream');

var start = Date.now();

var outstream = new stream;
outstream.readable = true;
outstream.writable = true;

var rl = readline.createInterface({
    input: fs.createReadStream('cross.txt'),
    output: outstream,
    terminal: false
});

rl.on('line', function(line) {
    //var tmp = line.split("\t").reverse().join('\t') + '\n';
    //fs.appendFileSync("op_rev.txt", tmp );
    // this logic to reverse and then sort is too slow
});

rl.on('close', function() {
    var closetime = Date.now();
    console.log('Read entirefile. ', (closetime - start)/1000, ' secs');
});

这时我基本上陷入了困境,我所拥有的只是从一个文件读取并写入另一个文件的能力,有没有一种方法可以在写入之前有效地对这些数据进行排序

I am basically stuck at this point, all I have is the ability to read from one file and write to another, is there a way to efficiently sort this data before writing it

推荐答案

DBsort-stream是很好的解决方案,但但是 DB可能会过大,我认为sort-stream最终只是排序整个文件都放在内存数组中(在through结束回调上),因此与原始解决方案相比,我认为性能将大致相同.
(但是我还没有运行任何基准测试,所以我可能是错的).

DB and sort-stream are fine solutions, but DB might be an overkill and I think sort-stream eventually just sorts the entire file in an in-memory array (on through end callback), so I think performance will be roughly the same, comparing to the original solution.
(but I haven't ran any benchmarks, so I might be wrong).

因此,仅出于破解目的,我将提出另一种解决方案:)

So, just for the hack of it, I'll throw in another solution :)

我很想知道这会有多大的区别,所以我运行了一些基准测试.

I was curious to see how big a difference this will be, so I ran some benchmarks.

结果甚至令我惊讶,事实证明, sort -k3,3解决方案到目前为止更好,比原始解决方案(简单的数组排序)快了10倍,而 nedb sort-stream解决方案的速度至少比原始解决方案慢x18倍(即,比sort -k3,3的速度至少慢x180倍).

Results were surprising even to me, turns out sort -k3,3 solution is better by far, x10 times faster then the original solution (a simple array sort), while nedb and sort-stream solutions are at least x18 times slower than the original solution (i.e. at least x180 times slower than sort -k3,3).

(请参阅下面的基准测试结果)

如果在* nix机器(Unix,Linux,Mac等)上,您可以简单地使用
sort -k 3,3 yourInputFile > op_rev.txt ,然后让操作系统为您进行排序.
由于排序是本地完成的,因此您可能会获得更好的性能.

If on a *nix machine (Unix, Linux, Mac, ...) you can simply use
sort -k 3,3 yourInputFile > op_rev.txt and let the OS do the sorting for you.
You'll probably get better performance, since sorting is done natively.

或者,如果要在Node中处理排序的输出,则:

Or, if you want to process the sorted output in Node:

var util = require('util'),
    spawn = require('child_process').spawn,
    sort = spawn('sort', ['-k3,3', './test.tsv']);

sort.stdout.on('data', function (data) {
    // process data
    data.toString()
        .split('\n')
        .map(line => line.split("\t"))
        .forEach(record => console.info(`Record: ${record}`));
});

sort.on('exit', function (code) {
    if (code) {
        // handle error
    }

    console.log('Done');
});

// optional
sort.stderr.on('data', function (data) {
    // handle error...
    console.log('stderr: ' + data);
});

希望这会有所帮助:)

编辑:添加一些基准测试详细信息.

Adding some benchmark details.

我很想知道这会有多大的区别,所以我运行了一些基准测试.

I was curious to see how big a difference this will be, so I ran some benchmarks.

以下是结果(在MacBook Pro上运行):

Here are the results (running on a MacBook Pro):

  • sort1 使用一种简单的方法,将记录排序为.
    平均时间: 35.6s (基准)

  • sort1 uses a straightforward approach, sorting the records in an in-memory array.
    Avg time: 35.6s (baseline)

sort2 使用sort-stream,如Joe Krill所建议. br> 平均时间: 11.1m (慢了大约 x18.7倍)
(我想知道为什么.我没有深入.)

sort2 uses sort-stream, as suggested by Joe Krill.
Avg time: 11.1m (about x18.7 times slower)
(I wonder why. I didn't dig in.)

sort3 使用nedb,如Tamas Hegedus所建议.
时间:大约 16m (大约慢了 x27倍)

sort3 uses nedb, as suggested by Tamas Hegedus.
Time: about 16m (about x27 times slower)

sort4 仅通过在终端中执行sort -k 3,3 input.txt > out4.txt进行排序
平均时间: 1.2秒(快约 x30倍)

sort4 only sorts by executing sort -k 3,3 input.txt > out4.txt in a terminal
Avg time: 1.2s (about x30 times faster)

sort5 使用sort -k3,3,并处理发送到stdout的响应
平均时间: 3.65s (快了 x9.7倍)

sort5 uses sort -k3,3, and process the response sent to stdout
Avg time: 3.65s (about x9.7 times faster)

这篇关于在Node.js中写入文件之前对数据流进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆