如何在countVectorizer中使用双字+字母+单词标记词? [英] How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

查看:296
本文介绍了如何在countVectorizer中使用双字+字母+单词标记词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将文本分类与朴素贝叶斯和countVectorizer一起用于对方言进行分类.我读了一篇研究论文,作者使用了以下组合:

I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :

bigrams + trigrams + word-marks vocabulary 

他在这里用单词标记来表示特定于某种方言的单词.

He means by word-marks here, the words that are specific to a certain dialect.

如何在countVectorizer中调整这些参数?

How can I tweak those parameters in countVectorizer?

这些是文字标记的示例,但我所没有的,因为我的是阿拉伯语.所以我翻译了他们.

So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.

word_marks=['love', 'funny', 'happy', 'amazing']

这些用于对文本进行分类.

Those are used to classify a text.

此外,在这篇文章中: 了解sklearn中CountVectorizer中的`ngram_range`参数

Also, in the this post: Understanding the `ngram_range` argument in a CountVectorizer in sklearn

有这个答案:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

我无法理解输出,[1,1]在这里是什么意思?以及他如何能够将ngram与词汇结合使用?他们不是互斥的吗?

I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?

推荐答案

您要使用n_gram range参数来使用双字母组和三字母组.在您的情况下,它应该是CountVectorizer(ngram_range =(1,3)).

You want to use the n_gram range argument to use bigrams and trigrams. In your case, it would be CountVectorizer(ngram_range=(1, 3)).

请参见该问题的答案.

有关更多详细信息.

See the accepted answer to this question for more details.

对于问题的另一部分,请提供文字标记"示例.

Please provide example of "word-marks" for the other part of your question.

您可能必须运行CountVectorizer两次-一次运行n-gram,一次运行您的自定义单词标记词汇.然后,您可以将两个CountVectorizer的两个输出连接起来,以获得单个功能集的n语法计数和自定义词汇计数.上述问题的答案还解释了如何为CountVectorizer的第二次使用指定自定义词汇.

You may have to run CountVectorizer twice - once for n-grams and once for your custom word-mark vocabulary. You can then concatenate the two outputs from the two CountVectorizers to get a single feature set of n-gram counts and custom vocabulary counts. The answer to the above question also explains how to specify a custom vocabulary for this second use of CountVectorizer.

这是在连接数组上的 SO答案

这篇关于如何在countVectorizer中使用双字+字母+单词标记词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆