"排序文件名| uniq"不适用于大文件 [英] "sort filename | uniq" does not work on large files

查看:91
本文介绍了"排序文件名| uniq"不适用于大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以从小文本文件中删除重复的条目,但不能从大文本文件中删除.
我有一个4MB的文件.
文件的开头看起来像这样:

I can remove duplicate entries from small text files, but not large text files.
I have a file that's 4MB.
The beginning of the file looks like this:

aa
aah
aahed
aahed
aahing
aahing
aahs
aahs
aal
aalii
aalii
aaliis
aaliis
...

我要删除重复项.
例如,"aahed"出现两次,而我只希望出现一次.

I want to remove the duplicates.
For example, "aahed" shows up twice, and I would only like it to show up once.

无论我尝试过哪种单线,大名单都不会改变.

No matter what one-liner I've tried, the big list will not change.

如果键入: 排序big_list.txt | uniq |更少
我看到了:

If It type: sort big_list.txt | uniq | less
I see:

aa
aah
aahed
aahed   <-- didn't get rid of it
aahing
aahing   <-- didn't get rid of it
aahs
aahs   <-- didn't get rid of it
aal
...

但是,如果我从该文本文件的顶部复制一小部分单词,然后在这小部分数据上重新运行该命令,它将达到预期的效果.

However, If I copy a small chunk of words from the top of this text file and re-run the command on the small chunk of data, it does what's expected.

这些程序是否因为文件太大而拒绝排序?我不认为4MB太大.它不会输出警告或其他任何信息.

Are these programs refusing to sort because the file is too big? I didn't think 4MB was very big. It doesn't output a warning or anything.

我很快编写了自己的"uniq"程序,它具有相同的行为.它仅在列表的一小部分上起作用,但对4MB的文本文件没有任何作用.我有什么问题?

I quickly wrote my own "uniq" program, and it has the same behavior. It works on a small subset of the list, but doesn't do anything to the 4MB text file. What's my issue?

这是一个十六进制转储:

Here is a hex dump:

00000000  61 61 0a 61 61 68 0a 61  61 68 65 64 0a 61 61 68  |aa.aah.aahed.aah|
00000010  65 64 0d 0a 61 61 68 69  6e 67 0a 61 61 68 69 6e  |ed..aahing.aahin|
00000020  67 0d 0a 61 61 68 73 0a  61 61 68 73 0d 0a 61 61  |g..aahs.aahs..aa|
00000030  6c 0a 61 61 6c 69 69 0a  61 61 6c 69 69 0d 0a 61  |l.aalii.aalii..a|
00000040  61 6c 69 69 73 0a 61 61  6c 69 69 73 0d 0a 61 61  |aliis.aaliis..aa|


61 61 68 65 64 0a
a  a  h  e  d  \r

61 61 68 65 64 0d
a  a  h  e  d  \n

已解决:不同的行定界符

推荐答案

您可以规范线距(将CR + LF转换为LF):

You can normalize line delimeters (convert CR+LF to LF):

sed 's/\r//' big_list.txt | sort -u

这篇关于&quot;排序文件名| uniq"不适用于大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆