从bash的文件中计算出现的词语 [英] Calculate Word occurrences from file in bash

查看:117
本文介绍了从bash的文件中计算出现的词语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是非常noob问题抱歉,但我对庆典一种新的编程(开始前几天)。基本上我想要做的就是保持一个文件与另一个文件中所有出现的词语。

我知道我能做到这一点:

 排序| uniq的-c |分类

的事情是,在那之后我想利用第二个文件,重新计算的出现和更新的第一个。之后我把第三个文件等等。

我在做什么,此刻没有任何问题的工作(我使用的grep SED AWK ),但它看起来pretty缓慢。

我是pretty肯定有一个非常有效的方式只​​是一个命令左右,使用 uniq的,但我想不通。

你能带我到正确的方式?

我也粘贴code我写的:

 #!/斌/庆典
#计数从文件单词出现的次数并写入另一个文件#
#的话都从最频繁的列在少一#触摸用于检查。检查occurrances#。临时文件
触摸distribution.txt#最终文件,所有出现的计算页面= $ 1号包含我的计算文件
对于事件发生= $ 2号临时文件#通吃通过​​将出现在文件$页单词和他们的订单
猫$页| TR -cs A-杂志-Z \\'的'\\ n'| TR A-Z A-Z> 。检查#循环更新旧文件与新的信息
#基本上是我做的是字检查单词并将它们​​添加到旧文件作为更新
猫。检查|而读单词

    字= $ {}词词#我计算
    strlen的= $ {#字}#字的长度
    #我用黑名单来不算禁用词汇(例如非常小的或inunfluent的话,像文章和prepositions
    如果! grep的-Fxq $字.blacklist&放大器;&安培; [$ strlen的-gt 2]
    然后
        #如果这个词从来没有发现它有1次写操作之前,
        如果[`egrep的-c -i^ $一句话:$ occurrences` -eq 0]
        然后
            回声$字:1|猫>> $事件
        否则#它计算事件
        其他
            老=`awk的话-v = $字-F:'$ 1 ==个字{$打印2}$ occurrences`
            让新老= + 1
            SED -iS / ^ $一句话:$旧/ $一句话:$新/ G$事件
        科幻
    科幻
DONERM。检查#最后,订购的话
AWK -F:'{$打印2$ 1}'$纪录|排序-rn | awk的-F'{打印$ 2:$ 1}'> distribution.txt


解决方案

好了,我不知道,我已经得到了你想要做的事来看,
但我会做这种方式:

 同时读取文件

  猫$文件| TR -cs A-杂志-Z \\'的'\\ n'| TR A-Z A-Z |排序| uniq的-c>统计。$文件
完成<文件列表

现在,你对所有文件的统计数据,现在你简单汇总吧:

 同时读取文件

  猫统计。$文件
完成<文件列表\\
|排序-k2 \\
| AWK'{如果($ 2 = preV!){打印的preV; S = 0;} S + = $ 1; preV = $ 2;} END {打印的preV;}'

使用示例:

  $因为我在LS庆典CP;做男人$ I> $ i.txt; DONE
$猫<<&EOF GT;文件列表
> ls.txt
> bash.txt
> cp.txt
> EOF$同时读取文件;做
>猫$文件| TR -cs A-杂志-Z \\'的'\\ n'| TR A-Z A-Z |排序| uniq的-c>统计。$文件
>完成<文件列表$同时读取文件
>做
>猫统计。$文件
>完成<文件列表\\
> |排序-k2 \\
> | AWK'{如果($ 2 = preV!){打印的preV; S = 0;} S + = $ 1; preV = $ 2;} END {打印的preV;}'|排序-rn |头在3875
1671是
1137到
1118一
1072
如果793
744和
533命令
514
507外壳

I'm sorry for the very noob question, but I'm kind of new to bash programming (started a few days ago). Basically what I want to do is keep one file with all the word occurrences of another file

I know I can do this:

sort | uniq -c | sort

the thing is that after that I want to take a second file, calculate the occurrences again and update the first one. After I take a third file and so on.

What I'm doing at the moment works without any problem (I'm using grep, sed and awk), but it looks pretty slow.

I'm pretty sure there is a very efficient way just with a command or so, using uniq, but I can't figure out.

Could you please lead me to the right way?

I'm also pasting the code I wrote:

#!/bin/bash
#   count the number of word occurrences from a file and writes to another file #
#   the words are listed from the most frequent to the less one                 #

touch .check                # used to check the occurrances. Temporary file
touch distribution.txt      # final file with all the occurrences calculated

page=$1             # contains the file I'm calculating
occurrences=$2          # temporary file for the occurrences

# takes all the words from the file $page and orders them by occurrences
cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z > .check

# loop to update the old file with the new information
# basically what I do is check word by word and add them to the old file as an update
cat .check | while read words
do
    word=${words}       # word I'm calculating
    strlen=${#word}     # word's length
    # I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions
    if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
    then
        # if the word was never found before it writes it with 1 occurrence
        if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
        then
            echo "$word: 1" | cat >> $occurrences
        # else it calculates the occurrences
        else
            old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
            let "new=old+1"
            sed -i "s/^$word: $old$/$word: $new/g" $occurrences
        fi
    fi
done

rm .check

# finally it orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt

解决方案

Well, I'm not sure that I've got the point of the thing you are trying to do, but I would do it this way:

while read file
do
  cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
done < file-list 

Now you have statistics for all your file, and now you simple aggregate it:

while read file
do
  cat stat.$file
done < file-list \
| sort -k2 \
| awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}'

Example of usage:

$ for i in ls bash cp; do man $i > $i.txt ; done
$ cat <<EOF > file-list
> ls.txt
> bash.txt
> cp.txt
> EOF

$ while read file; do
> cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
> done < file-list

$ while read file
> do
>   cat stat.$file
> done < file-list \
> | sort -k2 \
> | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head

3875 the
1671 is
1137 to
1118 a
1072 of
793 if
744 and
533 command
514 in
507 shell

这篇关于从bash的文件中计算出现的词语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆