如何使用Shell脚本从句子中删除停用词? [英] How can I remove the stop words from sentence using shell script?
问题描述
我要从文件中的句子中删除停用词吗?
I'm trying to remove stop words from sentences in file?
停止我的意思的单词: [I,a,an,as,at,at,in,for,of,on,on 、、、、、、、、、、、],
Stop Word which I mean :
[I, a, an, as, at, the, by, in, for, of, on, that]
我在文件 my_text.txt
中有这些句子:
I have these sentences in file my_text.txt
:
Unix系统设计的主要目标之一是创建一个促进高效程序的环境
One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program
然后我要从上面的句子中删除停用词
Then I want to remove stop word form the sentence above
我使用了这个脚本:
array=( I a an as at the by in for of on that )
for i in "${array[@]}"
do
cat $p | sed -e 's/\<$i\>//g'
done < my_text.txt
但是输出是:
Unix系统设计的主要目标之一是创建一个促进高效程序的环境
One of the primary goals in the design of the Unix system was to create an environment that promoted efficient program
预期输出应为:
设计Unix系统的主要目标是创建一个环境促进的高效计划
One primary goals design Unix system was to create an environment promoted efficient program
注意:我要删除删除停用词而不是重复的词吗?
Note: I want to Delete Remove stop words not duplicated words?
推荐答案
像这样,假设 $ p
是现有文件:
Like this, assuming $p
is an existing file:
sed -i -e "s/\<$i\>//g" "$p"
您必须使用双引号而不是单引号来扩展变量.
You have to use double quotes, not single quotes to get variables expanded.
-i
开关替换行中的 .
了解如何在shell中正确报价,这非常重要:
Learn how to quote properly in shell, it's very important :
双引号"包含空格/元字符和每一个扩展的每个文字:
"$ var""
," $(command"$ var"))"
,"$$ array [@]}"
,"a&b"
.使用's单引号'
来表示代码或使用$'s:'Costs $ 5 US'
,ssh host'echo'$ HOSTNAME'''
.参见
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words
"Double quote" every literal that contains spaces/metacharacters and every expansion:
"$var"
,"$(command "$var")"
,"${array[@]}"
,"a & b"
. Use'single quotes'
for code or literal$'s: 'Costs $5 US'
,ssh host 'echo "$HOSTNAME"'
. See
http://mywiki.wooledge.org/Quotes
http://mywiki.wooledge.org/Arguments
http://wiki.bash-hackers.org/syntax/words
最后
array=( I a an as at the by in for of on that )
for i in "${array[@]}"
do
sed -i -e "s/\<$i\>\s*//g" Input_File
done
奖金
尝试不使用 \ s *
来了解为什么我添加此正则表达式
Bonus
Try without \s*
to understand why I added this regex
这篇关于如何使用Shell脚本从句子中删除停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!