如何使用bash脚本计算单词中出现最多的3个字母的序列 [英] How can I count most occuring sequence of 3 letters within a word with a bash script

查看:84
本文介绍了如何使用bash脚本计算单词中出现最多的3个字母的序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个示例文件

XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant

在这里,我需要重复一个单词中最多出现的3个字母的顺序

Here I need to grep most occurring sequence of 3 letters within a word

输出应为

acc = 5 aco = 3

acc = 5 aco = 3

在Bash中有可能吗?

Is that possible in Bash?

我完全不知道如何用awk,sed,grep来完成它.

I got absolutely no idea how I can accomplish it with either awk, sed, grep.

任何线索,怎么可能...

Any clue how it's possible...

PS:无输出,因为我不知道该怎么做,我不想写不必要的awk -F,xyz abc ...在任何地方都无济于事...

PS: no output because I got no idea how to do that, I dont wanna wrote unnecessary awk -F, xyz abc... that not gonna help anywhere...

推荐答案

以下是您尝试做的事情的入门方法:

Here's how to get started with what I THINK you're trying to do:

$ cat tst.awk
BEGIN { stringLgth = 3 }
{
    for (fldNr=1; fldNr<=NF; fldNr++) {
        field = $fldNr
        fieldLgth = length(field)
        if ( fieldLgth >= stringLgth ) {
            maxBegPos = fieldLgth - (stringLgth - 1)
            for (begPos=1; begPos<=maxBegPos; begPos++) {
                string = tolower(substr(field,begPos,stringLgth))
                cnt[string]++
            }
        }
    }
}
END {
    for (string in cnt) {
        print string, cnt[string]
    }
}

.

$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1

这篇关于如何使用bash脚本计算单词中出现最多的3个字母的序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆