快速Shell命令删除文本文件中的停用词 [英] Fast shell command to remove stop words in a text file
问题描述
我有一个2GB的文本文件.我正在尝试从此文件中删除经常出现的英语停用词.
I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file.
我有stopwords.txt,其中包含这样的内容.
I have stopwords.txt containing like this..
a
an
the
for
and
I
使用shell命令(例如tr,sed或awk)执行此操作的快速方法是什么?
What is the fast method to do this using shell command such as tr, sed or awk?
推荐答案
以下是使用命令行和perl
的方法:
Here's a method using the command line and perl
:
将下面的文本另存为replacesw.sh
:
#! /bin/bash
MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b
perl -pe "s/$MYREGEX//g" $2
然后,如果您已将上面的文件另存为stopwords.txt
,并且还有一个名为testtext.txt
的第二个文件,其中包含:
Then if you have saved your file above as stopwords.txt
, and have a second file (e.g.) called testtext.txt
that contains:
This is a file with the stopwords from the stopwords.txt for testing.
More than one line in the file, for a better test.
然后在命令行中的以下内容将删除stopwords
:
Then the following at the command line will remove the stopwords
:
KBs-MBP13:temp kbenoit$ ./replacesw.sh stopwords.txt testtext.txt
This is file with stopwords from stopwords.txt testing.
More than one line in file, better test.
您可能需要先chmod u+x replacesw.sh
.
这篇关于快速Shell命令删除文本文件中的停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!