如何防止 TfidfVectorizer 将数字作为词汇表 [英] How can I prevent TfidfVectorizer to get numbers as vocabulary

查看:43
本文介绍了如何防止 TfidfVectorizer 将数字作为词汇表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我像这样使用 TfidfVectorizer:

I use TfidfVectorizer like this:

from sklearn.feature_extraction.text import TfidfVectorizer
stop_words = stopwords.words("english")
vectorizer = TfidfVectorizer(stop_words=stop_words, min_df=200)
xs['train'] = vectorizer.fit_transform(docs['train'])
xs['test'] = vectorizer.transform(docs['test']).toarray()

但是在检查 vectorizer.vocabulary_ 时,我注意到它学习纯数字特征:

But when inspecting vectorizer.vocabulary_ I've noticed that it learns pure number features:

[(u'00', 0), (u'000', 1), (u'0000', 2), (u'00000', 3), (u'000000', 4)

我不想要这个.我该如何预防?

I don't want this. How can I prevent it?

推荐答案

您可以在初始化矢量化器时定义 token_pattern.默认的是 u'(?u)\b\w\w+\b' ((?u) 部分只是将 re.UNICODE 标志).可以摆弄那个直到你得到你需要的东西.

You could define the token_pattern when initing the vectorizer. The default one is u'(?u)\b\w\w+\b' (the (?u) part is just turning the re.UNICODE flag on). Could fiddle with that until you get what you need.

类似于:

vectorizer = TfidfVectorizer(stop_words=stop_words,
                             min_df=200,
                             token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b')

<小时>

另一种选择(如果样本中出现数字这一事实很重要)是在矢量化之前屏蔽所有数字.


Another option (if the fact that numbers appear in your samples matter) is to mask all the numbers before vectorizing.

re.sub('\b[0-9][0-9.,-]*\b', 'NUMBER-SPECIAL-TOKEN', sample)

这样数字就会在您的矢量化词汇表中出现相同的位置,您也不会完全忽略它们.

This way numbers will hit the same spot in your vectorizer's vocabulary and you won't completely ignore them either.

这篇关于如何防止 TfidfVectorizer 将数字作为词汇表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆