在Node.js中写入文件之前对数据流进行排序 [英] Sorting a data stream before writing to file in nodejs
问题描述
我有一个输入文件,该文件可能包含多达1M条记录,每条记录看起来像这样
I have an input file which may potentially contain upto 1M records and each record would look like this
field 1 field 2 field3 \n
我想读取此输入文件并根据field3
对其进行排序,然后再将其写入另一个文件.
I want to read this input file and sort it based on field3
before writing it to another file.
这是我到目前为止所拥有的
here is what I have so far
var fs = require('fs'),
readline = require('readline'),
stream = require('stream');
var start = Date.now();
var outstream = new stream;
outstream.readable = true;
outstream.writable = true;
var rl = readline.createInterface({
input: fs.createReadStream('cross.txt'),
output: outstream,
terminal: false
});
rl.on('line', function(line) {
//var tmp = line.split("\t").reverse().join('\t') + '\n';
//fs.appendFileSync("op_rev.txt", tmp );
// this logic to reverse and then sort is too slow
});
rl.on('close', function() {
var closetime = Date.now();
console.log('Read entirefile. ', (closetime - start)/1000, ' secs');
});
这时我基本上陷入了困境,我所拥有的只是从一个文件读取并写入另一个文件的能力,有没有一种方法可以在写入之前有效地对这些数据进行排序
I am basically stuck at this point, all I have is the ability to read from one file and write to another, is there a way to efficiently sort this data before writing it
推荐答案
DB
和sort-stream
是很好的解决方案,但但是 DB可能会过大,我认为sort-stream
最终只是排序整个文件都放在内存数组中(在through
结束回调上),因此与原始解决方案相比,我认为性能将大致相同.
(但是我还没有运行任何基准测试,所以我可能是错的).
DB
and sort-stream
are fine solutions, but DB might be an overkill and I think sort-stream
eventually just sorts the entire file in an in-memory array (on through
end callback), so I think performance will be roughly the same, comparing to the original solution.
(but I haven't ran any benchmarks, so I might be wrong).
因此,仅出于破解目的,我将提出另一种解决方案:)
So, just for the hack of it, I'll throw in another solution :)
我很想知道这会有多大的区别,所以我运行了一些基准测试.
I was curious to see how big a difference this will be, so I ran some benchmarks.
结果甚至令我惊讶,事实证明, sort -k3,3
解决方案到目前为止更好,比原始解决方案(简单的数组排序)快了10倍,而 nedb
sort-stream
解决方案的速度至少比原始解决方案慢x18倍(即,比sort -k3,3
的速度至少慢x180倍).
Results were surprising even to me, turns out sort -k3,3
solution is better by far, x10 times faster then the original solution (a simple array sort), while nedb
and sort-stream
solutions are at least x18 times slower than the original solution (i.e. at least x180 times slower than sort -k3,3
).
(请参阅下面的基准测试结果)
如果在* nix机器(Unix,Linux,Mac等)上,您可以简单地使用
sort -k 3,3 yourInputFile > op_rev.txt
,然后让操作系统为您进行排序.
由于排序是本地完成的,因此您可能会获得更好的性能.
If on a *nix machine (Unix, Linux, Mac, ...) you can simply use
sort -k 3,3 yourInputFile > op_rev.txt
and let the OS do the sorting for you.
You'll probably get better performance, since sorting is done natively.
或者,如果要在Node中处理排序的输出,则:
Or, if you want to process the sorted output in Node:
var util = require('util'),
spawn = require('child_process').spawn,
sort = spawn('sort', ['-k3,3', './test.tsv']);
sort.stdout.on('data', function (data) {
// process data
data.toString()
.split('\n')
.map(line => line.split("\t"))
.forEach(record => console.info(`Record: ${record}`));
});
sort.on('exit', function (code) {
if (code) {
// handle error
}
console.log('Done');
});
// optional
sort.stderr.on('data', function (data) {
// handle error...
console.log('stderr: ' + data);
});
希望这会有所帮助:)
编辑:添加一些基准测试详细信息.
Adding some benchmark details.
我很想知道这会有多大的区别,所以我运行了一些基准测试.
I was curious to see how big a difference this will be, so I ran some benchmarks.
以下是结果(在MacBook Pro上运行):
Here are the results (running on a MacBook Pro):
-
sort1 使用一种简单的方法,将记录排序为
.
平均时间: 35.6s (基准)
sort1 uses a straightforward approach, sorting the records in an
in-memory array
.
Avg time: 35.6s (baseline)
sort2 使用sort-stream
,如Joe Krill所建议. br>
平均时间: 11.1m (慢了大约 x18.7倍)
(我想知道为什么.我没有深入.)
sort2 uses sort-stream
, as suggested by Joe Krill.
Avg time: 11.1m (about x18.7 times slower)
(I wonder why. I didn't dig in.)
sort3 使用nedb
,如Tamas Hegedus所建议.
时间:大约 16m (大约慢了 x27倍)
sort3 uses nedb
, as suggested by Tamas Hegedus.
Time: about 16m (about x27 times slower)
sort4 仅通过在终端中执行sort -k 3,3 input.txt > out4.txt
进行排序
平均时间: 1.2秒(快约 x30倍)
sort4 only sorts by executing sort -k 3,3 input.txt > out4.txt
in a terminal
Avg time: 1.2s (about x30 times faster)
sort5 使用sort -k3,3
,并处理发送到stdout的响应
平均时间: 3.65s (快了 x9.7倍)
sort5 uses sort -k3,3
, and process the response sent to stdout
Avg time: 3.65s (about x9.7 times faster)
这篇关于在Node.js中写入文件之前对数据流进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!