"排序文件名| uniq"不适用于大文件 [英] "sort filename | uniq" does not work on large files
问题描述
我可以从小文本文件中删除重复的条目,但不能从大文本文件中删除.
我有一个4MB的文件.
文件的开头看起来像这样:
I can remove duplicate entries from small text files, but not large text files.
I have a file that's 4MB.
The beginning of the file looks like this:
aa
aah
aahed
aahed
aahing
aahing
aahs
aahs
aal
aalii
aalii
aaliis
aaliis
...
我要删除重复项.
例如,"aahed"出现两次,而我只希望出现一次.
I want to remove the duplicates.
For example, "aahed" shows up twice, and I would only like it to show up once.
无论我尝试过哪种单线,大名单都不会改变.
No matter what one-liner I've tried, the big list will not change.
如果键入:
排序big_list.txt | uniq |更少
我看到了:
If It type:
sort big_list.txt | uniq | less
I see:
aa
aah
aahed
aahed <-- didn't get rid of it
aahing
aahing <-- didn't get rid of it
aahs
aahs <-- didn't get rid of it
aal
...
但是,如果我从该文本文件的顶部复制一小部分单词,然后在这小部分数据上重新运行该命令,它将达到预期的效果.
However, If I copy a small chunk of words from the top of this text file and re-run the command on the small chunk of data, it does what's expected.
这些程序是否因为文件太大而拒绝排序?我不认为4MB太大.它不会输出警告或其他任何信息.
Are these programs refusing to sort because the file is too big? I didn't think 4MB was very big. It doesn't output a warning or anything.
我很快编写了自己的"uniq"程序,它具有相同的行为.它仅在列表的一小部分上起作用,但对4MB的文本文件没有任何作用.我有什么问题?
I quickly wrote my own "uniq" program, and it has the same behavior. It works on a small subset of the list, but doesn't do anything to the 4MB text file. What's my issue?
这是一个十六进制转储:
Here is a hex dump:
00000000 61 61 0a 61 61 68 0a 61 61 68 65 64 0a 61 61 68 |aa.aah.aahed.aah|
00000010 65 64 0d 0a 61 61 68 69 6e 67 0a 61 61 68 69 6e |ed..aahing.aahin|
00000020 67 0d 0a 61 61 68 73 0a 61 61 68 73 0d 0a 61 61 |g..aahs.aahs..aa|
00000030 6c 0a 61 61 6c 69 69 0a 61 61 6c 69 69 0d 0a 61 |l.aalii.aalii..a|
00000040 61 6c 69 69 73 0a 61 61 6c 69 69 73 0d 0a 61 61 |aliis.aaliis..aa|
61 61 68 65 64 0a
a a h e d \r
61 61 68 65 64 0d
a a h e d \n
已解决:不同的行定界符
推荐答案
您可以规范线距(将CR + LF转换为LF):
You can normalize line delimeters (convert CR+LF to LF):
sed 's/\r//' big_list.txt | sort -u
这篇关于"排序文件名| uniq"不适用于大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!