如何在countVectorizer中使用双字+字母+单词标记词? [英] How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?
问题描述
我将文本分类与朴素贝叶斯和countVectorizer一起用于对方言进行分类.我读了一篇研究论文,作者使用了以下组合:
I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :
bigrams + trigrams + word-marks vocabulary
他在这里用单词标记来表示特定于某种方言的单词.
He means by word-marks here, the words that are specific to a certain dialect.
如何在countVectorizer中调整这些参数?
How can I tweak those parameters in countVectorizer?
这些是文字标记的示例,但我所没有的,因为我的是阿拉伯语.所以我翻译了他们.
So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.
word_marks=['love', 'funny', 'happy', 'amazing']
这些用于对文本进行分类.
Those are used to classify a text.
此外,在这篇文章中: 了解sklearn中CountVectorizer中的`ngram_range`参数
Also, in the this post: Understanding the `ngram_range` argument in a CountVectorizer in sklearn
有这个答案:
>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]]) # unigram and bigram found
我无法理解输出,[1,1]在这里是什么意思?以及他如何能够将ngram与词汇结合使用?他们不是互斥的吗?
I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?
推荐答案
您要使用n_gram range参数来使用双字母组和三字母组.在您的情况下,它应该是CountVectorizer(ngram_range =(1,3)).
You want to use the n_gram range argument to use bigrams and trigrams. In your case, it would be CountVectorizer(ngram_range=(1, 3)).
请参见该问题的答案.
有关更多详细信息.See the accepted answer to this question for more details.
对于问题的另一部分,请提供文字标记"示例.
Please provide example of "word-marks" for the other part of your question.
您可能必须运行CountVectorizer两次-一次运行n-gram,一次运行您的自定义单词标记词汇.然后,您可以将两个CountVectorizer的两个输出连接起来,以获得单个功能集的n语法计数和自定义词汇计数.上述问题的答案还解释了如何为CountVectorizer的第二次使用指定自定义词汇.
You may have to run CountVectorizer twice - once for n-grams and once for your custom word-mark vocabulary. You can then concatenate the two outputs from the two CountVectorizers to get a single feature set of n-gram counts and custom vocabulary counts. The answer to the above question also explains how to specify a custom vocabulary for this second use of CountVectorizer.
这是在连接数组上的 SO答案
这篇关于如何在countVectorizer中使用双字+字母+单词标记词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!