从文本文件列表中删除停用词 [英] Removing stopwords from a list of text files

查看:80
本文介绍了从文本文件列表中删除停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个处理过的文本文件列表,看起来有点像这样:

I have a list of processed text files, that looks somewhat like this:

text = "这是第一个文本文档" 这是第二个文本文档" 这是第三个文档"

text = "this is the first text document " this is the second text document " this is the third document "

我已经能够成功地标记句子:

I've been able to successfully tokenize the sentences:

sentences = sent_tokenize(text)
    for ii, sentence in enumerate(sentences):
        sentences[ii] = remove_punctuation(sentence)
sentence_tokens = [word_tokenize(sentence) for sentence in sentences]

现在我想从这个标记列表中删除停用词.
但是,因为它是文本文档列表中的句子列表,我似乎无法弄清楚如何做到这一点.

And now I would like to remove stopwords from this list of tokens.
However, because it's a list of sentences within a list of text documents, I can't seem to figure out how to do this.

这是我迄今为止尝试过的,但没有返回任何结果:

This is what I've tried so far, but it returns no results:

sentence_tokens_no_stopwords = [w for w in sentence_tokens if w not in stopwords]

我假设实现这一点需要某种 for 循环,但我现在所拥有的不起作用.任何帮助将不胜感激!

I'm assuming achieving this will require some sort of for loop, but what I have now isn't working. Any help would be appreciated!

推荐答案

你可以像这样创建两个列表生成器:

You can create two lists generators like that:

sentence_tokens_no_stopwords = [[w for w in s if w not in stopwords] for s in sentence_tokens ]

这篇关于从文本文件列表中删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆