快速Shell命令删除文本文件中的停用词 [英] Fast shell command to remove stop words in a text file

查看：106 发布时间：2020/5/18 1:02:30 shell nlp text-processing

本文介绍了快速Shell命令删除文本文件中的停用词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个2GB的文本文件.我正在尝试从此文件中删除经常出现的英语停用词.

I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file.

我有stopwords.txt，其中包含这样的内容.

I have stopwords.txt containing like this..

a
an
the
for
and
I

使用shell命令(例如tr，sed或awk)执行此操作的快速方法是什么?

What is the fast method to do this using shell command such as tr, sed or awk?

推荐答案

以下是使用命令行和perl的方法:

Here's a method using the command line and perl:

将下面的文本另存为replacesw.sh:

#! /bin/bash
MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b
perl -pe "s/$MYREGEX//g" $2

然后，如果您已将上面的文件另存为stopwords.txt，并且还有一个名为testtext.txt的第二个文件，其中包含:

Then if you have saved your file above as stopwords.txt, and have a second file (e.g.) called testtext.txt that contains:

This is a file with the stopwords from the stopwords.txt for testing.
More than one line in the file, for a better test.

然后在命令行中的以下内容将删除stopwords:

Then the following at the command line will remove the stopwords:

KBs-MBP13:temp kbenoit$ ./replacesw.sh stopwords.txt testtext.txt 
This is  file with  stopwords from  stopwords.txt  testing.
More than one line in  file,   better test.

您可能需要先chmod u+x replacesw.sh.

这篇关于快速Shell命令删除文本文件中的停用词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

快速Shell命令删除文本文件中的停用词 [英] Fast shell command to remove stop words in a text file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

快速Shell命令删除文本文件中的停用词 [英] Fast shell command to remove stop words in a text file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭