向CountVectorizer(sklearn)添加词干支持 [英] add stemming support to CountVectorizer (sklearn)

查看:314
本文介绍了向CountVectorizer(sklearn)添加词干支持的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用sklearn将词干添加到NLP中的管道中.

I'm trying to add stemming to my pipeline in NLP with sklearn.

from nltk.stem.snowball import FrenchStemmer

stop = stopwords.words('french')
stemmer = FrenchStemmer()


class StemmedCountVectorizer(CountVectorizer):
    def __init__(self, stemmer):
        super(StemmedCountVectorizer, self).__init__()
        self.stemmer = stemmer

    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc:(self.stemmer.stem(w) for w in analyzer(doc))

stem_vectorizer = StemmedCountVectorizer(stemmer)
text_clf = Pipeline([('vect', stem_vectorizer), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='linear', C=1)) ])

将此管道与sklearn的CountVectorizer一起使用时,它可以工作.而且,如果我手动创建这样的功能,它也会起作用.

When using this pipeline with the CountVectorizer of sklearn it works. And if I create manually the features like this it works also.

vectorizer = StemmedCountVectorizer(stemmer)
vectorizer.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)

编辑:

如果我在IPython Notebook上尝试使用此管道,它将显示[*],但没有任何反应.当我查看我的终端时,它会显示此错误:

If I try this pipeline on my IPython Notebook it displays the [*] and nothing happens. When I look at my terminal, it gives this error :

Process PoolWorker-12:
Traceback (most recent call last):
  File "C:\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Anaconda2\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Anaconda2\lib\multiprocessing\pool.py", line 102, in worker
    task = get()
  File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 360, in get
    return recv()
AttributeError: 'module' object has no attribute 'StemmedCountVectorizer'

示例

这是完整的示例

from sklearn.pipeline import Pipeline
from sklearn import grid_search
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.stem.snowball import FrenchStemmer

stemmer = FrenchStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemming(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

X = ['le chat est beau', 'le ciel est nuageux', 'les gens sont gentils', 'Paris est magique', 'Marseille est tragique', 'JCVD est fou']
Y = [1,0,1,1,0,0]

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC())])
parameters = { 'vect__analyzer': ['word', stemming]}

gs_clf = grid_search.GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf.fit(X, Y)

如果从参数中删除源代码,它将起作用,否则它将不起作用.

If you remove stemming from the parameters it works otherwise it doesn't work.

更新:

问题似乎在并行化过程中,因为删除 n_jobs = -1 时问题消失了.

The problem seems to be in the parallelization process because when removing n_jobs=-1 the problem disappear.

推荐答案

您可以将可调用对象作为analyzer传递给CountVectorizer构造函数以提供自定义分析器.这似乎对我有用.

You can pass a callable as analyzer to the CountVectorizer constructor to provide a custom analyzer. This appears to work for me.

from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import FrenchStemmer

stemmer = FrenchStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

stem_vectorizer = CountVectorizer(analyzer=stemmed_words)
print(stem_vectorizer.fit_transform(['Tu marches dans la rue']))
print(stem_vectorizer.get_feature_names())

打印出:

  (0, 4)    1
  (0, 2)    1
  (0, 0)    1
  (0, 1)    1
  (0, 3)    1
[u'dan', u'la', u'march', u'ru', u'tu']

这篇关于向CountVectorizer(sklearn)添加词干支持的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆