在纯文本文件中查找和列出重复的单词 [英] Finding and Listing Duplicate Words in a Plain Text file
问题描述
我有一个很大的文件,我想弄清楚. 我使用du -ah命令生成了包含许多文件的整个目录结构的列表. 结果基本上以纯文本格式列出了特定文件夹下的所有文件夹以及该文件夹内的后续文件.
I have a rather large file that I am trying to make sense of. I generated a list of my entire directory structure that contains a lot of files using the du -ah command. The result basically lists all the folders under a specific folder and the consequent files inside the folder in plain text format.
例如:
4.0G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC/B119_C004_0918XJ_003.R3D
3.1G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC/B119_C004_0918XJ_004.R3D
15G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC
有没有我可以运行的命令或可以使用的实用程序,可以帮助我识别是否有多个相同文件名的记录(通常每行+扩展名的后16个字符),如果存在此类重复条目,则将整个路径(全行)写出到另一个文本文件,以便我可以使用脚本或其他内容从NAS找到并移出重复文件.
Is there any command that I can run or utility that I can use that will help me identify if there is more than one record of the same filename (usually the last 16 characters in each line + extension) and if such duplicate entries exist, to write out the entire path (full line) to a different text file so i can find and move out duplicate files from my NAS, using a script or something.
请让我知道,因为当纯文本文件本身为5.2Mb时,这样做的压力非常大:)
Please let me know as this is incredibly stressful to do when the plaintext file itself is 5.2Mb :)
推荐答案
在/
上拆分每一行,获取最后一项(cut
无法做到,因此请还原每一行并采用第一行),然后进行排序并运行uniq
和-d
来显示重复项.
Split each line on /
, get the last item (cut
cannot do it, so revert each line and take the first one), then sort and run uniq
with -d
which shows duplicates.
rev FILE | cut -f1 -d/ | rev | sort | uniq -d
这篇关于在纯文本文件中查找和列出重复的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!