bash脚本:计数文件唯一的行 [英] Bash Script: count unique lines in file
问题描述
我有一个很大的文件(百万行)从几个小时的网络捕获包含IP地址和端口,每行一个IP /端口。行格式如下:
I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:
ip.ad.dre.ss[:port]
期望的结果:
有是在登录时,我收到的每个数据包的条目,所以有许多重复的地址。我希望能够通过某种类型的shell脚本,就可以将其降低到格式
Desired result:
There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format
ip.ad.dre.ss[:port] count
其中,计数
是特定的地址(和端口)的出现的次数。没有特别的工作要做,对待不同的端口为不同的地址。
where count
is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.
到目前为止,我使用这个命令刮所有从日志文件中的IP地址:
So far, I'm using this command to scrape all of the ip addresses from the log file:
grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt
从这里,我可以用一个相当简单的正则表达式来刮掉所有由我的地址发送的IP地址(我不在乎)
From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)
然后我就可以使用以下方法来提取独特的条目:
I can then use the following to extract the unique entries:
sort -u ips.txt > intermediate.txt
我不知道我怎么能聚集符合那种莫名其妙计数。
I don't know how I can aggregate the line counts somehow with sort.
推荐答案
您可以使用 uniq的
命令来获取排序重复的行数:
You can use the uniq
command to get counts of sorted repeated lines:
sort ips.txt | uniq -c
要获得最常见的结果顶部(感谢彼得·贾里奇):
To get the most frequent results at top (thanks to Peter Jaric):
sort ips.txt | uniq -c | sort -bgr
这篇关于bash脚本:计数文件唯一的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!