在Linux中对多个文件进行排序 [英] sort across multiple files in linux

查看:468
本文介绍了在Linux中对多个文件进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个(许多)文件;每个非常大:

I have multiple (many) files; each very large:

file0.txt
file1.txt
file2.txt

我不想将它们合并为一个文件,因为生成的文件将是10+ Gigs.每个文件中的每一行都包含一个40字节的字符串.字符串现在排列得很整齐(大约1:10的步长是值的减少而不是增加).

I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).

我希望订购的行. (如果可能,是否在适当的位置?)这意味着从file0.txt末尾开始的某些行将移到file1.txt的开始处,反之亦然.

I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt will be moved to the beginning of file1.txt and vice versa.

我正在Linux上工作,并且还很新.我知道单个文件的sort命令,但是想知道是否有一种方法可以对多个文件进行排序.也许有一种方法可以使伪文件由较小的文件组成,而这些文件会被linux视为单个文件.

I am working on Linux and fairly new to it. I know about the sort command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.

我知道可以做什么: 我可以分别对每个文件进行排序并读入file1.txt,以找到大于file0.txt中最大文件的值(并类似地从file0.txt末尾抓起几行),然后再进行排序..但这很麻烦并假设file0.txt中没有任何值属于file0.txt(但是在我的情况下极不可能)

What I know can do: I can sort each file individually and read into file1.txt to find the value larger than the largest in file0.txt (and similarly grab the lines from the end of file0.txt), join and then sort.. but this is a pain and assumes no values from file2.txt belong in file0.txt (however highly unlikely in my case)

要明确的是,文件是否如下所示:

To be clear, if the files look like this:

f0.txt
DDD
XXX
AAA

f1.txt
BBB
FFF
CCC

f2.txt
EEE
YYY
ZZZ

我想要这个:

f0.txt
AAA
BBB
CCC

f1.txt
DDD
EEE
FFF

f2.txt
XXX
YYY
ZZZ

推荐答案

我不知道执行就地排序的命令,但我认为可以进行更快的合并排序":

I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:

for file in *.txt; do
    sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output

  • for循环中的sort确保输入文件的内容已排序.如果您不想覆盖原始文件,只需更改-o参数后的值即可. (如果您希望文件已被排序,则可以将排序语句更改为仅检查":sort -c $file || exit 1)
  • 第二个sort可以高效地合并输入文件,同时保持输出排序.
  • 这通过管道传递到split命令,该命令随后将写入后缀的输出文件.注意-字符;这告诉split从标准输入(即管道)读取而不是从文件读取.
    • The sort in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -o parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1)
    • The second sort does efficient merging of the input files, all while keeping the output sorted.
    • This is piped to the split command which will then write to suffixed output files. Notice the - character; this tells split to read from standard input (i.e. the pipe) instead of a file.
    • 此外,这是合并排序方式的简短摘要:

      Also, here's a short summary of how the merge sort works:

      1. sort从每个文件读取一行.
      2. 它将对这些行进行排序,然后选择应该排在最前面的那一行.该行被发送到输出,并从包含该行的文件中读取新行.
      3. 重复步骤2,直到任何文件中没有更多行为止.
      4. 这时,输出应该是一个完美排序的文件.
      5. 利润!
      1. sort reads a line from each file.
      2. It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
      3. Repeat step 2 until there are no more lines in any file.
      4. At this point, the output should be a perfectly sorted file.
      5. Profit!

      这篇关于在Linux中对多个文件进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆