如何使用bash脚本计算单词中出现最多的3个字母的序列 [英] How can I count most occuring sequences of 3 letters within a word with a bash script
本文介绍了如何使用bash脚本计算单词中出现最多的3个字母的序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
Indo Cheap的示例文件如下
Indo Cheap has a sample file like
XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant
他需要获得一个单词中最多出现的3个字母的序列.
He needs to get the most occurring sequences of 3 letters within a word.
输出应为
acc = 5 aco = 3
他问在bash中是否可行.
He asks if that is possible in bash.
他说:我完全不知道如何用awk,sed和grep来完成它.
He says: "I got absolutely no idea how I can accomplish it with either awk, sed, grep.
任何线索都可能..."
Any clue how it's possible..."
推荐答案
使用bash,sed和awk绝对可行,这是如何做到的:
This absolutely possible with bash, sed and awk, and here is how to do it:
#!/bin/bash
for line in $(cat sample | tr 'A-Z' 'a-z' | tr -s ' ' '\n'); do
ll=${#line}
for ((i = 0; i < ll - 2; i++)) ; do # for each word
echo ${line:i:3} # print all sequences of 3 letters
done
done |
sort | # sort the sequences of three letters
uniq -c | # count the sequences
sed '/^ *1 /d' | # filter out the not repeated sequences
sort -n -r | # most frequent sequences first
awk -F ' ' '{print $2" = "$1}' | # format output as asked
tr '\n' ' ' # put all results on one line
echo # add a new line at the end
上面示例的输出是:
cou = 5 acc = 5 unt = 4 tin = 4 oun = 4 nti = 4 ing = 4 cco = 4 aco = 3
如果需要其他输出格式,我们可以根据需要轻松调整脚本代码.
In case another format of output is wanted, we can easily adapt the code of the script according to the needs.
这篇关于如何使用bash脚本计算单词中出现最多的3个字母的序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文