scikit中的partial_fit中遇到的错误学习 [英] Errors encountered in partial_fit in scikit learn

查看:378
本文介绍了scikit中的partial_fit中遇到的错误学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用scikit中的partial_fit函数进行训练时,我发现在程序未终止的情况下出现以下错误,即使训练后的模型行为正确并给出了正确的输出,这又是怎么回事,其目的是什么?这有什么好担心的吗?

On training with a partial_fit function in scikit learn I get the following error without the program terminating , how is that possible and what are the repurcussions of this even though the trained model behaves correctly and gives correct output. Is this something to worry about?

/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py:207: RuntimeWarning: divide by zero encountered in log
  self.class_log_prior_ = (np.log(self.class_count_)

我正在使用以下经过修改的训练函数,因为我必须维护一个不变的labels \ classes列表,因为partial_fit不允许在后续运行中添加新的classs \ labels,因此每批训练数据中的Priority类都是相同的:

I am using the following modified training function as I have to maintain a constant list of labels\classes as the partial_fit does not allow adding new classes\labels on subsequent runs , the class prior is same in each batch of training data:

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets,classes=None, partial=True):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))
        X = self._vectorizer.fit_transform(X)
        y = self._encoder.fit_transform(y)



        if partial:
            classes=self._encoder.fit_transform(list(set(classes)))
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
            self._clf.fit(X, y)

        return self

在第二次调用partial_fit时,对于class count = 2000,它也会引发以下错误,并且在调用model = self.train(featureset,classes = labels,partial = partial)时,训练样本为3592:

Also on the second call to partial_fit it throws following error for class count=2000 , and training samples are 3592 on calling model = self.train(featureset, classes=labels,partial=partial):

self._clf.partial_fit(X, y, classes=list(set(classes)))
  File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 277, in partial_fit
    self._count(X, Y)
  File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 443, in _count
    self.feature_count_ += safe_sparse_dot(Y.T, X)
ValueError: operands could not be broadcast together with shapes (2000,11430) (2000,10728) (2000,11430) 

根据抛出的错误,我在哪里出错?这是否意味着我要输入不正确的尺寸数据? 我尝试了以下操作,现在正在致电:

Where am I going wrong based on the error thrown? Does it mean that I am pushing in incorrect dimensioned data ? I tried following , I am now calling :

        X = self._vectorizer.transform(X)
        y = self._encoder.transform(y)

每次调用局部拟合.之前,我对每个partialfit调用都使用fittransform.这是正确的

each time the partial fit is called. Earlier I used fittransform for each partialfit call. Is this correct

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets, classes=None, partial=False):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:
            classes = self._encoder.fit_transform(np.unique(classes))
            X = self._vectorizer.transform(X)
            y = self._encoder.transform(y)
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self._clf

经过多次尝试,通过考虑首次调用,我能够使以下代码正常工作,但是我假设每次迭代后分类器腌制的文件大小都会增加,但是每批我得到的是相同大小的pkl文件这是不可能的:

After many tries I was able to get the following code working, by accounting for first call but I had assumed that the classifier pickled files would be increasing in size after each iteration but I am getting the same sized pkl file for each batch which is not possible:

 class MySklearnClassifier(SklearnClassifier):

    def train(self, labeled_featuresets, classes=None, partial=False,firstcall=True):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:

           if firstcall:
                classes = self._encoder.fit_transform(np.unique(classes))
                X = self._vectorizer.fit_transform(X)
                y = self._encoder.fit_transform(y)
                self._clf.partial_fit(X, y, classes=classes)
           else:

                X = self._vectorizer.transform(X)
                y = self._encoder.fit_transform(y)
                self._clf.partial_fit(X, y)
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self

这是完整的代码:

class postagger(ClassifierBasedTagger):
    """
    A classifier based postagger.
    """
    #MySklearnClassifier()
    def __init__(self, feature_detector=None, train=None,estimator=None,

                 classifierinstance=None, backoff=None,
                 cutoff_prob=None, verbose=True):

        if backoff is None:
            self._taggers = [self]
        else:
            self._taggers = [self] + backoff._taggers
        if estimator:
            classifier = MySklearnClassifier(estimator=estimator)
            #MySklearnClassifier.__init__(self, classifier)
        elif classifierinstance:
            classifier=classifierinstance






        if feature_detector is not None:
            self._feature_detector = feature_detector
            # The feature detector function, used to generate a featureset
            # or each token: feature_detector(tokens, index, history) -> featureset

        self._cutoff_prob = cutoff_prob
        """Cutoff probability for tagging -- if the probability of the
           most likely tag is less than this, then use backoff."""

        self._classifier = classifier
        """The classifier used to choose a tag for each token."""

        # if train and picklename:
        #     self._train(classifier_builder, picklename,tagged_corpus=train, ONLYERRORS=False,verbose=True,onlyfeatures=True ,LOADCONSTRUCTED=None)

    def legacy_getfeatures(self, tagged_corpus=None, ONLYERRORS=False, existingfeaturesetfile=None, verbose=True,
                           labels=artlabels):

        featureset = []
        labels=artlabels
        if not existingfeaturesetfile and tagged_corpus:
            if ONLYERRORS:

                classifier_corpus = open(tagged_corpus + '-ONLYERRORS.richfeature', 'w')
            else:
                classifier_corpus = open(tagged_corpus + '.richfeature', 'w')

            if verbose:
                print('Constructing featureset  for training corpus for classifier.')
            nlp = English()
            #df=pandas.DataFrame()
            store = HDFStore('featurestore.h5')



            for sentence in sPickle.s_load(open(tagged_corpus,'r')):
                untagged_words, tags, senindex = zip(*sentence)
                doc = nlp(u' '.join(untagged_words))
                # untagged_sentence, tags , rest = unpack_three(*zip(*sentence))
                for index in range(len(sentence)):
                    if ONLYERRORS:
                        if tags[index] == '<!SAME!>' and random.random() < 0.05:
                            featureset = self.new_feature_detector(doc, index)
                            sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
                            featureset['label']=tags[index]
                            featureset['senindex']=str(senindex[0])
                            featureset['wordindex']=index
                            df=pandas.DataFrame([featureset])
                            store.append('df',df,index=False,min_itemsize = 150)
                            # classifier_corpus.append((featureset, tags[index]))
                        elif tags[index] in labels:
                            featureset = self.new_feature_detector(doc, index)
                            sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
                            featureset['label']=tags[index]
                            featureset['senindex']=str(senindex[0])
                            featureset['wordindex']=index
                            df=pandas.DataFrame([featureset])
                            store.append('df',df,index=False,min_itemsize = 150)


                        # classifier_corpus.append((featureset, tags[index]))
        # else:
        #     for element in sPickle.s_load(open(existingfeaturesetfile, 'w')):
        #         featureset.append( element)

        return tagged_corpus + '.richfeature'

    def _train(self, featuresetdata, classifier_builder=MultinomialNB(), partial=False, batchsize=500):
        """
        Build a new classifier, based on the given training data
        *tagged_corpus*.

        """



        #labels = set(cPickle.load(open(arguments['-k'], 'r')))
        if partial==False:
           print('Training classifier FULLMODE')
           featureset = []
           for element in sPickle.s_load(open(featuresetdata, 'r')):
               featureset.append(element)

           model = self._classifier.train(featureset, classes=artlabels, partial=False,firstcall=True)
           print('Training complete, dumping')
           try:
            joblib.dump(model,  str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.mpkl')
            print "joblib dumped"
           except:
               print "joblib error"
           cPickle.dump(model, open(str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.cmpkl', 'w'))
           print('dumped')
           return
        #joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)

        print('Training classifier each batch of {} training points'.format(batchsize))

        for i, batchelement in enumerate(batch(sPickle.s_load(open(featuresetdata, 'r')), batchsize)):
            featureset = []
            for element in batchelement:
                featureset.append(element)



            # model =  super(postagger, self).train (featureset, partial)
            # pdb.set_trace()
            # featureset = [item for sublist in featureset for item in sublist]
            trainsize = len(featureset)
            print("submitting {} training points for training\neg last one:".format(trainsize))
            for d, l in featureset:
                if len(d) != 113:
                    print d
                    assert False

            print featureset[-1]
            # pdb.set_trace()
            try:
                if i==0:
                    model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=True)
                else:
                    model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=False)

            except:
                type, value, tb = sys.exc_info()
                traceback.print_exc()
                pdb.post_mortem(tb)

            print('Training for batch {} complete, dumping'.format(i))
            cPickle.dump(model, open(
                str(featuresetdata) + '-' + slugify(str(classifier_builder))[
                                            :10] + 'UPDATED batch-{} of {} points.mpkl'.format(
                    i, trainsize), 'w'))
            print('dumped')
        #joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)

    def untag(self,tagged_sentence):
        """
        Given a tagged sentence, return an untagged version of that
        sentence.  I.e., return a list containing the first element
        of each tuple in *tagged_sentence*.

            >>> from nltk.tag.util import untag
            >>> untag([('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')])
            ['John', 'saw', 'Mary']

        """

        return [w[0] for w in tagged_sentence]

    def evaluate(self, gold):
        """
        Score the accuracy of the tagger against the gold standard.
        Strip the tags from the gold standard text, retag it using
        the tagger, then compute the accuracy score.

        :type gold: list(list(tuple(str, str)))
        :param gold: The list of tagged sentences to score the tagger on.
        :rtype: float
        """
        gold_tokens=[]
        full_gold_tokens=[]

        tagged_sents = self.tag_sents(self.untag(sent) for sent in gold)
        for sentence in gold:#flatten the list

            untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature = zip(*sentence)


            gold_tokens.extend(zip(untagged_sentences,goldtags))
            full_gold_tokens.extend(zip( untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature))





        test_tokens = sum(tagged_sents, []) #flatten the list
        getmismatch(gold_tokens,test_tokens,full_gold_tokens)
        return accuracy(gold_tokens, test_tokens)

    #
    def new_feature_detector(self, tokens, index):
        return getfeatures(tokens, index)


    def tag_sents(self, sentences):
        """
        Apply ``self.tag()`` to each element of *sentences*.  I.e.:

            return [self.tag(sent) for sent in sentences]
        """
        return [self.tag(sent) for sent in sentences]

    def tag(self, tokens):
        # docs inherited from TaggerI
        tags = []
        for i in range(len(tokens)):
            tags.append(self.tag_one(tokens, i))
        return list(zip(tokens, tags))

    def tag_one(self, tokens, index):
        """
        Determine an appropriate tag for the specified token, and
        return that tag.  If this tagger is unable to determine a tag
        for the specified token, then its backoff tagger is consulted.

        :rtype: str
        :type tokens: list
        :param tokens: The list of words that are being tagged.
        :type index: int
        :param index: The index of the word whose tag should be
            returned.
        :type history: list(str)
        :param history: A list of the tags for all words before *index*.
        """
        tag = None
        for tagger in self._taggers:
            tag = tagger.choose_tag(tokens, index)
            if tag is not None:  break
        return tag

    def choose_tag(self, tokens, index):
        # Use our feature detector to get the featureset.
        featureset = self.new_feature_detector(tokens, index)

        # Use the classifier to pick a tag.  If a cutoff probability
        # was specified, then check that the tag's probability is
        # higher than that cutoff first; otherwise, return None.

        if self._cutoff_prob is None:
            return self._classifier.prob_classify_many([featureset])
            #return self._classifier.classify_many([featureset])


        pdist = self._classifier.prob_classify_many([featureset])
        tag = pdist.max()
        return tag if pdist.prob(tag) >= self._cutoff_prob else None

推荐答案

1. RuntimeWarning

您收到此警告是因为np.log在0处被调用

1. The RuntimeWarning

You're getting this warning because np.log is called on 0:

In [6]: np.log(0)
/home/anaconda/envs/python34/lib/python3.4/site-packages/ipykernel/__main__.py:1: RuntimeWarning: divide by zero encountered in log
  if __name__ == '__main__':
Out[6]: -inf

这是因为在您的一个调用中,有些类根本没有表示(它们的计数为0),因此np.log被调用为0.您不必担心.

That's because in one of your call, some classes are not represented at all (they have a count of 0) and thus np.log is called on 0. You don't need to worry about it.

我正在使用以下经过修改的训练函数,因为我必须维护一个不变的labels \ classes列表,因为partial_fit不允许在后续运行中添加新的classs \ labels,每批训练数据中的classprior都是相同的

I am using the following modified training function as I have to maintain a constant list of labels\classes as the partial_fit does not allow adding new classes\labels on subsequent runs , the class prior is same in each batch of training data

  • 很正确,如果您使用的是partial_fit,则需要从头开始传递标签/类别列表.
  • 我不确定每批训练数据中的班级是否相同.那可能有几种不同的含义,如果您能在这里阐明您的意思,那将是很好的.
    同时,诸如MultinomialNB之类的分类器的默认行为是它们优先于数据(基本上它们计算频率).使用partial_fit时,他们将逐步进行 计算,以便获得与使用单个fit调用相同的结果.
    • You are right that you need to pass the list of labels/classes from the start if you're using partial_fit.
    • I'm unsure about the class prior being the same in each batch of training data. That could have several different meanings, it would be nice if you could clarify what you meant here.
      In the mean time, the default behavior for classifiers such as MultinomialNB is that they fit priors to the data (basically they compute frequencies). When using partial_fit, they will do this computation incrementally so that you get the same result as if you had used a single fit call.
    • 在第二次调用partial_fit时,对于class count = 2000也会引发以下错误,并且训练样本在调用model = self.train(featureset,classes = labels,partial = partial)时为3592

      Also on the second call to partial_fit it throws following error for class count=2000 , and training samples are 3592 on calling model = self.train(featureset, classes=labels,partial=partial)

      在这里,我们需要更多详细信息.我对X的形状为(n_samples, n_features)感到困惑,但是在回溯中它似乎是(2000,11430)形状.这意味着X具有2000个样本.

      Here we need more details. I'm confused that X is of shape (n_samples, n_features) and yet in the traceback it appears to be of shape (2000,11430). That means X has 2000 samples.

      该错误确实意味着您输入的尺寸不一致.我建议为每个partial_fit调用在矢量化后分别打印X.shapey.shape .

      The error indeed means that the dimensions of your inputs are inconsistent. I would suggest printing X.shape, y.shape after vectorization for each partial_fit call.

      此外,您不应该在每次调用partial_fit转换X的矢量化器上调用fitfit_transform:您应该一次安装,然后只需转换X.是为了确保您获得变形X的尺寸一致.

      Also you should not be calling fit or fit_transform on the vectorizer that transforms X for each partial_fit call: you should fit it once, then just transform X. This is to ensure that you get consistent dimensions for your transformed X.

      这是您告诉我们您正在使用的代码:

      Here's the code you told us you were using:

      class MySklearnClassifier(SklearnClassifier):
          def train(self, labeled_featuresets, classes=None, partial=False):
              """
              Train (fit) the scikit-learn estimator.
      
              :param labeled_featuresets: A list of ``(featureset, label)``
                  where each ``featureset`` is a dict mapping strings to either
                  numbers, booleans or strings.
              """
      
              X, y = list(compat.izip(*labeled_featuresets))
      
              if partial:
                  classes = self._encoder.fit_transform(np.unique(classes))
                  X = self._vectorizer.transform(X)
                  y = self._encoder.transform(y)
                  self._clf.partial_fit(X, y, classes=list(set(classes)))
              else:
                   X = self._vectorizer.fit_transform(X)
                   y = self._encoder.fit_transform(y)
                   self._clf.fit(X, y)
      
              return self._clf
      

      据我所知,这并没有多大问题,但我们确实需要更多有关您如何在此处使用它的背景信息.
      nitpick:我认为将classes变量作为类属性会更清楚,因为每个partial_fit调用此变量都必须相同.
      如果您将不同的值传递给构造函数classes参数,那么在这里您可能做错了事.

      As far as I can tell there's not much wrong with that, but we really need more context as to how you're using it here.
      A nitpick: I feel it'd be clearer if you put the classes variable as a class attribute, since this variable needs to be the same for each partial_fit call.
      Here you might be doing something wrong if you pass different values to the constructor classes argument.

      更多可以帮助我们帮助您的信息:

      More information that could help us help you:

      • X形,y形的印刷品.
      • 上下文:您如何使用提供的代码?
      • 您在_vectorizer_encoder中使用什么?您最终将使用哪个分类器?
      • Prints of X.shape, y.shape.
      • Context: how are you using the code you provided??
      • What are you using for _vectorizer, _encoder? What classifier are you ultimately working with?

      这篇关于scikit中的partial_fit中遇到的错误学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆