如何为 sklearn CountVectorizer 设置自定义停用词? [英] How to set custom stop words for sklearn CountVectorizer?

查看：55 发布时间：2022/1/2 17:52:20 python machine-learning scikit-learn nlp

本文介绍了如何为 sklearn CountVectorizer 设置自定义停用词?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在非英语文本数据集上运行 LDA(潜在狄利克雷分配).

I'm trying to run LDA (Latent Dirichlet Allocation) on a non-English text dataset.

从 sklearn 的教程中，您可以计算输入 LDA 的单词的词频:

From sklearn's tutorial, there's this part where you count term frequency of the words to feed into the LDA:

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                            max_features=n_features,
                            stop_words='english')

它具有内置的停用词功能，我认为仅适用于英语.我该如何使用我自己的停用词列表?

Which has built-in stop words feature which is only available for English I think. How could I use my own stop words list for this?

推荐答案

您可以将自己的话的 frozenset 分配给 stop_words 参数，例如:

You may just assign a frozenset of your own words to the stop_words argument, e.g.:

stop_words = frozenset(["word1", "word2","word3"])

这篇关于如何为 sklearn CountVectorizer 设置自定义停用词?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何为 sklearn CountVectorizer 设置自定义停用词? [英] How to set custom stop words for sklearn CountVectorizer?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何为 sklearn CountVectorizer 设置自定义停用词? [英] How to set custom stop words for sklearn CountVectorizer?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭