如何使用bash脚本计算单词中出现最多的3个字母的序列 [英] How can I count most occuring sequences of 3 letters within a word with a bash script

查看:55
本文介绍了如何使用bash脚本计算单词中出现最多的3个字母的序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Indo Cheap的示例文件如下

Indo Cheap has a sample file like

XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant

他需要获得一个单词中最多出现的3个字母的序列.

He needs to get the most occurring sequences of 3 letters within a word.

输出应为

acc = 5 aco = 3

他问在bash中是否可行.

He asks if that is possible in bash.

他说:我完全不知道如何用awk,sed和grep来完成它.

He says: "I got absolutely no idea how I can accomplish it with either awk, sed, grep.

任何线索都可能..."

Any clue how it's possible..."

推荐答案

使用bash,sed和awk绝对可行,这是如何做到的:

This absolutely possible with bash, sed and awk, and here is how to do it:

#!/bin/bash

for line in $(cat sample | tr 'A-Z' 'a-z' | tr -s ' ' '\n'); do
  ll=${#line}
  for ((i = 0; i < ll - 2; i++)) ; do   # for each word
    echo ${line:i:3}                    # print all sequences of 3 letters
  done
done | 
  sort |                                # sort the sequences of three letters
  uniq -c |                             # count the sequences
  sed '/^ *1 /d' |                      # filter out the not repeated sequences
  sort -n -r |                          # most frequent sequences first
  awk -F ' ' '{print $2" = "$1}' |      # format output as asked
  tr '\n' ' '                           # put all results on one line 
echo                                    # add a new line at the end

上面示例的输出是:

cou = 5 acc = 5 unt = 4 tin = 4 oun = 4 nti = 4 ing = 4 cco = 4 aco = 3

如果需要其他输出格式,我们可以根据需要轻松调整脚本代码.

In case another format of output is wanted, we can easily adapt the code of the script according to the needs.

这篇关于如何使用bash脚本计算单词中出现最多的3个字母的序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆