如何在countVectorizer中使用双字+字母+单词标记词? [英] How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

查看：296 发布时间：2020/5/18 1:07:52 machine-learning nlp text-classification countvectorizer

本文介绍了如何在countVectorizer中使用双字+字母+单词标记词?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我将文本分类与朴素贝叶斯和countVectorizer一起用于对方言进行分类.我读了一篇研究论文，作者使用了以下组合:

I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :

bigrams + trigrams + word-marks vocabulary

他在这里用单词标记来表示特定于某种方言的单词.

He means by word-marks here, the words that are specific to a certain dialect.

如何在countVectorizer中调整这些参数?

How can I tweak those parameters in countVectorizer?

这些是文字标记的示例，但我所没有的，因为我的是阿拉伯语.所以我翻译了他们.

So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.

word_marks=['love', 'funny', 'happy', 'amazing']

这些用于对文本进行分类.

Those are used to classify a text.

此外，在这篇文章中: 了解sklearn中CountVectorizer中的`ngram_range`参数

Also, in the this post: Understanding the `ngram_range` argument in a CountVectorizer in sklearn

有这个答案:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

我无法理解输出，[1,1]在这里是什么意思?以及他如何能够将ngram与词汇结合使用?他们不是互斥的吗?

I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?

如何在countVectorizer中使用双字+字母+单词标记词? [英] How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何在countVectorizer中使用双字+字母+单词标记词? [英] How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭