如何使用Shell脚本从句子中删除停用词? [英] How can I remove the stop words from sentence using shell script?

查看:51
本文介绍了如何使用Shell脚本从句子中删除停用词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要从文件中的句子中删除停用词吗?

I'm trying to remove stop words from sentences in file?

停止我的意思的单词:
[I,a,an,as,at,at,in,for,of,on,on 、、、、、、、、、、、],

Stop Word which I mean :
[I, a, an, as, at, the, by, in, for, of, on, that]

我在文件 my_text.txt 中有这些句子:

I have these sentences in file my_text.txt :

Unix系统设计的主要目标之一是创建一个促进高效程序的环境

One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program

然后我要从上面的句子中删除停用词

Then I want to remove stop word form the sentence above

我使用了这个脚本:

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
cat $p  | sed -e 's/\<$i\>//g' 
done < my_text.txt

但是输出是:

Unix系统设计的主要目标之一是创建一个促进高效程序的环境

One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program

预期输出应为:

设计Unix系统的主要目标是创建一个环境促进的高效计划

One primary goals design Unix system was to create an environment promoted efficient program

注意:我要删除删除停用词而不是重复的词吗?

Note: I want to Delete Remove stop words not duplicated words?

推荐答案

像这样,假设 $ p 是现有文件:

Like this, assuming $p is an existing file:

 sed -i -e "s/\<$i\>//g" "$p"

您必须使用双引号而不是单引号来扩展变量.

You have to use double quotes, not single quotes to get variables expanded.

-i 开关替换行中的 .

了解如何在shell中正确报价,这非常重要:

Learn how to quote properly in shell, it's very important :

双引号"包含空格/元字符和每一个扩展的每个文字:"$ var""" $(command"$ var"))""$$ array [@]}" "a&b" .使用's单引号'来表示代码或使用 $'s:'Costs $ 5 US' ssh host'echo'$ HOSTNAME'''.参见
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words

"Double quote" every literal that contains spaces/metacharacters and every expansion: "$var", "$(command "$var")", "${array[@]}", "a & b". Use 'single quotes' for code or literal $'s: 'Costs $5 US', ssh host 'echo "$HOSTNAME"'. See
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words

最后

array=( I a an as at the by in for of on that  )
for i in "${array[@]}"
do
    sed -i -e "s/\<$i\>\s*//g" Input_File 
done

奖金

尝试不使用 \ s * 来了解为什么我添加此正则表达式

Bonus

Try without \s* to understand why I added this regex

这篇关于如何使用Shell脚本从句子中删除停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆