scikit中的partial_fit中遇到的错误学习 [英] Errors encountered in partial_fit in scikit learn
问题描述
在使用scikit中的partial_fit函数进行训练时,我发现在程序未终止的情况下出现以下错误,即使训练后的模型行为正确并给出了正确的输出,这又是怎么回事,其目的是什么?这有什么好担心的吗?
On training with a partial_fit function in scikit learn I get the following error without the program terminating , how is that possible and what are the repurcussions of this even though the trained model behaves correctly and gives correct output. Is this something to worry about?
/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py:207: RuntimeWarning: divide by zero encountered in log
self.class_log_prior_ = (np.log(self.class_count_)
我正在使用以下经过修改的训练函数,因为我必须维护一个不变的labels \ classes列表,因为partial_fit不允许在后续运行中添加新的classs \ labels,因此每批训练数据中的Priority类都是相同的:
I am using the following modified training function as I have to maintain a constant list of labels\classes as the partial_fit does not allow adding new classes\labels on subsequent runs , the class prior is same in each batch of training data:
class MySklearnClassifier(SklearnClassifier):
def train(self, labeled_featuresets,classes=None, partial=True):
"""
Train (fit) the scikit-learn estimator.
:param labeled_featuresets: A list of ``(featureset, label)``
where each ``featureset`` is a dict mapping strings to either
numbers, booleans or strings.
"""
X, y = list(compat.izip(*labeled_featuresets))
X = self._vectorizer.fit_transform(X)
y = self._encoder.fit_transform(y)
if partial:
classes=self._encoder.fit_transform(list(set(classes)))
self._clf.partial_fit(X, y, classes=list(set(classes)))
else:
self._clf.fit(X, y)
return self
在第二次调用partial_fit时,对于class count = 2000,它也会引发以下错误,并且在调用model = self.train(featureset,classes = labels,partial = partial)时,训练样本为3592:
Also on the second call to partial_fit it throws following error for class count=2000 , and training samples are 3592 on calling model = self.train(featureset, classes=labels,partial=partial):
self._clf.partial_fit(X, y, classes=list(set(classes)))
File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 277, in partial_fit
self._count(X, Y)
File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 443, in _count
self.feature_count_ += safe_sparse_dot(Y.T, X)
ValueError: operands could not be broadcast together with shapes (2000,11430) (2000,10728) (2000,11430)
根据抛出的错误,我在哪里出错?这是否意味着我要输入不正确的尺寸数据? 我尝试了以下操作,现在正在致电:
Where am I going wrong based on the error thrown? Does it mean that I am pushing in incorrect dimensioned data ? I tried following , I am now calling :
X = self._vectorizer.transform(X)
y = self._encoder.transform(y)
每次调用局部拟合.之前,我对每个partialfit调用都使用fittransform.这是正确的
each time the partial fit is called. Earlier I used fittransform for each partialfit call. Is this correct
class MySklearnClassifier(SklearnClassifier):
def train(self, labeled_featuresets, classes=None, partial=False):
"""
Train (fit) the scikit-learn estimator.
:param labeled_featuresets: A list of ``(featureset, label)``
where each ``featureset`` is a dict mapping strings to either
numbers, booleans or strings.
"""
X, y = list(compat.izip(*labeled_featuresets))
if partial:
classes = self._encoder.fit_transform(np.unique(classes))
X = self._vectorizer.transform(X)
y = self._encoder.transform(y)
self._clf.partial_fit(X, y, classes=list(set(classes)))
else:
X = self._vectorizer.fit_transform(X)
y = self._encoder.fit_transform(y)
self._clf.fit(X, y)
return self._clf
经过多次尝试,通过考虑首次调用,我能够使以下代码正常工作,但是我假设每次迭代后分类器腌制的文件大小都会增加,但是每批我得到的是相同大小的pkl文件这是不可能的:
After many tries I was able to get the following code working, by accounting for first call but I had assumed that the classifier pickled files would be increasing in size after each iteration but I am getting the same sized pkl file for each batch which is not possible:
class MySklearnClassifier(SklearnClassifier):
def train(self, labeled_featuresets, classes=None, partial=False,firstcall=True):
"""
Train (fit) the scikit-learn estimator.
:param labeled_featuresets: A list of ``(featureset, label)``
where each ``featureset`` is a dict mapping strings to either
numbers, booleans or strings.
"""
X, y = list(compat.izip(*labeled_featuresets))
if partial:
if firstcall:
classes = self._encoder.fit_transform(np.unique(classes))
X = self._vectorizer.fit_transform(X)
y = self._encoder.fit_transform(y)
self._clf.partial_fit(X, y, classes=classes)
else:
X = self._vectorizer.transform(X)
y = self._encoder.fit_transform(y)
self._clf.partial_fit(X, y)
else:
X = self._vectorizer.fit_transform(X)
y = self._encoder.fit_transform(y)
self._clf.fit(X, y)
return self
这是完整的代码:
class postagger(ClassifierBasedTagger):
"""
A classifier based postagger.
"""
#MySklearnClassifier()
def __init__(self, feature_detector=None, train=None,estimator=None,
classifierinstance=None, backoff=None,
cutoff_prob=None, verbose=True):
if backoff is None:
self._taggers = [self]
else:
self._taggers = [self] + backoff._taggers
if estimator:
classifier = MySklearnClassifier(estimator=estimator)
#MySklearnClassifier.__init__(self, classifier)
elif classifierinstance:
classifier=classifierinstance
if feature_detector is not None:
self._feature_detector = feature_detector
# The feature detector function, used to generate a featureset
# or each token: feature_detector(tokens, index, history) -> featureset
self._cutoff_prob = cutoff_prob
"""Cutoff probability for tagging -- if the probability of the
most likely tag is less than this, then use backoff."""
self._classifier = classifier
"""The classifier used to choose a tag for each token."""
# if train and picklename:
# self._train(classifier_builder, picklename,tagged_corpus=train, ONLYERRORS=False,verbose=True,onlyfeatures=True ,LOADCONSTRUCTED=None)
def legacy_getfeatures(self, tagged_corpus=None, ONLYERRORS=False, existingfeaturesetfile=None, verbose=True,
labels=artlabels):
featureset = []
labels=artlabels
if not existingfeaturesetfile and tagged_corpus:
if ONLYERRORS:
classifier_corpus = open(tagged_corpus + '-ONLYERRORS.richfeature', 'w')
else:
classifier_corpus = open(tagged_corpus + '.richfeature', 'w')
if verbose:
print('Constructing featureset for training corpus for classifier.')
nlp = English()
#df=pandas.DataFrame()
store = HDFStore('featurestore.h5')
for sentence in sPickle.s_load(open(tagged_corpus,'r')):
untagged_words, tags, senindex = zip(*sentence)
doc = nlp(u' '.join(untagged_words))
# untagged_sentence, tags , rest = unpack_three(*zip(*sentence))
for index in range(len(sentence)):
if ONLYERRORS:
if tags[index] == '<!SAME!>' and random.random() < 0.05:
featureset = self.new_feature_detector(doc, index)
sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
featureset['label']=tags[index]
featureset['senindex']=str(senindex[0])
featureset['wordindex']=index
df=pandas.DataFrame([featureset])
store.append('df',df,index=False,min_itemsize = 150)
# classifier_corpus.append((featureset, tags[index]))
elif tags[index] in labels:
featureset = self.new_feature_detector(doc, index)
sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
featureset['label']=tags[index]
featureset['senindex']=str(senindex[0])
featureset['wordindex']=index
df=pandas.DataFrame([featureset])
store.append('df',df,index=False,min_itemsize = 150)
# classifier_corpus.append((featureset, tags[index]))
# else:
# for element in sPickle.s_load(open(existingfeaturesetfile, 'w')):
# featureset.append( element)
return tagged_corpus + '.richfeature'
def _train(self, featuresetdata, classifier_builder=MultinomialNB(), partial=False, batchsize=500):
"""
Build a new classifier, based on the given training data
*tagged_corpus*.
"""
#labels = set(cPickle.load(open(arguments['-k'], 'r')))
if partial==False:
print('Training classifier FULLMODE')
featureset = []
for element in sPickle.s_load(open(featuresetdata, 'r')):
featureset.append(element)
model = self._classifier.train(featureset, classes=artlabels, partial=False,firstcall=True)
print('Training complete, dumping')
try:
joblib.dump(model, str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.mpkl')
print "joblib dumped"
except:
print "joblib error"
cPickle.dump(model, open(str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.cmpkl', 'w'))
print('dumped')
return
#joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)
print('Training classifier each batch of {} training points'.format(batchsize))
for i, batchelement in enumerate(batch(sPickle.s_load(open(featuresetdata, 'r')), batchsize)):
featureset = []
for element in batchelement:
featureset.append(element)
# model = super(postagger, self).train (featureset, partial)
# pdb.set_trace()
# featureset = [item for sublist in featureset for item in sublist]
trainsize = len(featureset)
print("submitting {} training points for training\neg last one:".format(trainsize))
for d, l in featureset:
if len(d) != 113:
print d
assert False
print featureset[-1]
# pdb.set_trace()
try:
if i==0:
model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=True)
else:
model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=False)
except:
type, value, tb = sys.exc_info()
traceback.print_exc()
pdb.post_mortem(tb)
print('Training for batch {} complete, dumping'.format(i))
cPickle.dump(model, open(
str(featuresetdata) + '-' + slugify(str(classifier_builder))[
:10] + 'UPDATED batch-{} of {} points.mpkl'.format(
i, trainsize), 'w'))
print('dumped')
#joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)
def untag(self,tagged_sentence):
"""
Given a tagged sentence, return an untagged version of that
sentence. I.e., return a list containing the first element
of each tuple in *tagged_sentence*.
>>> from nltk.tag.util import untag
>>> untag([('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')])
['John', 'saw', 'Mary']
"""
return [w[0] for w in tagged_sentence]
def evaluate(self, gold):
"""
Score the accuracy of the tagger against the gold standard.
Strip the tags from the gold standard text, retag it using
the tagger, then compute the accuracy score.
:type gold: list(list(tuple(str, str)))
:param gold: The list of tagged sentences to score the tagger on.
:rtype: float
"""
gold_tokens=[]
full_gold_tokens=[]
tagged_sents = self.tag_sents(self.untag(sent) for sent in gold)
for sentence in gold:#flatten the list
untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature = zip(*sentence)
gold_tokens.extend(zip(untagged_sentences,goldtags))
full_gold_tokens.extend(zip( untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature))
test_tokens = sum(tagged_sents, []) #flatten the list
getmismatch(gold_tokens,test_tokens,full_gold_tokens)
return accuracy(gold_tokens, test_tokens)
#
def new_feature_detector(self, tokens, index):
return getfeatures(tokens, index)
def tag_sents(self, sentences):
"""
Apply ``self.tag()`` to each element of *sentences*. I.e.:
return [self.tag(sent) for sent in sentences]
"""
return [self.tag(sent) for sent in sentences]
def tag(self, tokens):
# docs inherited from TaggerI
tags = []
for i in range(len(tokens)):
tags.append(self.tag_one(tokens, i))
return list(zip(tokens, tags))
def tag_one(self, tokens, index):
"""
Determine an appropriate tag for the specified token, and
return that tag. If this tagger is unable to determine a tag
for the specified token, then its backoff tagger is consulted.
:rtype: str
:type tokens: list
:param tokens: The list of words that are being tagged.
:type index: int
:param index: The index of the word whose tag should be
returned.
:type history: list(str)
:param history: A list of the tags for all words before *index*.
"""
tag = None
for tagger in self._taggers:
tag = tagger.choose_tag(tokens, index)
if tag is not None: break
return tag
def choose_tag(self, tokens, index):
# Use our feature detector to get the featureset.
featureset = self.new_feature_detector(tokens, index)
# Use the classifier to pick a tag. If a cutoff probability
# was specified, then check that the tag's probability is
# higher than that cutoff first; otherwise, return None.
if self._cutoff_prob is None:
return self._classifier.prob_classify_many([featureset])
#return self._classifier.classify_many([featureset])
pdist = self._classifier.prob_classify_many([featureset])
tag = pdist.max()
return tag if pdist.prob(tag) >= self._cutoff_prob else None
推荐答案
1. RuntimeWarning
您收到此警告是因为np.log
在0处被调用
1. The RuntimeWarning
You're getting this warning because np.log
is called on 0:
In [6]: np.log(0)
/home/anaconda/envs/python34/lib/python3.4/site-packages/ipykernel/__main__.py:1: RuntimeWarning: divide by zero encountered in log
if __name__ == '__main__':
Out[6]: -inf
这是因为在您的一个调用中,有些类根本没有表示(它们的计数为0),因此np.log
被调用为0.您不必担心.
That's because in one of your call, some classes are not represented at all (they have a count of 0) and thus np.log
is called on 0. You don't need to worry about it.
我正在使用以下经过修改的训练函数,因为我必须维护一个不变的labels \ classes列表,因为partial_fit不允许在后续运行中添加新的classs \ labels,每批训练数据中的classprior都是相同的
I am using the following modified training function as I have to maintain a constant list of labels\classes as the partial_fit does not allow adding new classes\labels on subsequent runs , the class prior is same in each batch of training data
- 很正确,如果您使用的是
partial_fit
,则需要从头开始传递标签/类别列表. - 我不确定每批训练数据中的班级是否相同.那可能有几种不同的含义,如果您能在这里阐明您的意思,那将是很好的.
同时,诸如MultinomialNB
之类的分类器的默认行为是它们优先于数据(基本上它们计算频率).使用partial_fit
时,他们将逐步进行 计算,以便获得与使用单个fit
调用相同的结果. - You are right that you need to pass the list of labels/classes from the start if you're using
partial_fit
. - I'm unsure about the class prior being the same in each batch of training data. That could have several different meanings, it would be nice if you could clarify what you meant here.
In the mean time, the default behavior for classifiers such asMultinomialNB
is that they fit priors to the data (basically they compute frequencies). When usingpartial_fit
, they will do this computation incrementally so that you get the same result as if you had used a singlefit
call.
在第二次调用partial_fit时,对于class count = 2000也会引发以下错误,并且训练样本在调用model = self.train(featureset,classes = labels,partial = partial)时为3592
Also on the second call to partial_fit it throws following error for class count=2000 , and training samples are 3592 on calling model = self.train(featureset, classes=labels,partial=partial)
在这里,我们需要更多详细信息.我对X
的形状为(n_samples, n_features)
感到困惑,但是在回溯中它似乎是(2000,11430)
形状.这意味着X
具有2000个样本.
Here we need more details. I'm confused that X
is of shape (n_samples, n_features)
and yet in the traceback it appears to be of shape (2000,11430)
. That means X
has 2000 samples.
该错误确实意味着您输入的尺寸不一致.我建议为每个partial_fit
调用在矢量化后分别打印X.shape
,y.shape
.
The error indeed means that the dimensions of your inputs are inconsistent. I would suggest printing X.shape
, y.shape
after vectorization for each partial_fit
call.
此外,您不应该在每次调用partial_fit
转换X
的矢量化器上调用fit
或fit_transform
:您应该一次安装,然后只需转换X.是为了确保您获得变形X的尺寸一致.
Also you should not be calling fit
or fit_transform
on the vectorizer that transforms X
for each partial_fit
call: you should fit it once, then just transform X. This is to ensure that you get consistent dimensions for your transformed X.
这是您告诉我们您正在使用的代码:
Here's the code you told us you were using:
class MySklearnClassifier(SklearnClassifier):
def train(self, labeled_featuresets, classes=None, partial=False):
"""
Train (fit) the scikit-learn estimator.
:param labeled_featuresets: A list of ``(featureset, label)``
where each ``featureset`` is a dict mapping strings to either
numbers, booleans or strings.
"""
X, y = list(compat.izip(*labeled_featuresets))
if partial:
classes = self._encoder.fit_transform(np.unique(classes))
X = self._vectorizer.transform(X)
y = self._encoder.transform(y)
self._clf.partial_fit(X, y, classes=list(set(classes)))
else:
X = self._vectorizer.fit_transform(X)
y = self._encoder.fit_transform(y)
self._clf.fit(X, y)
return self._clf
据我所知,这并没有多大问题,但我们确实需要更多有关您如何在此处使用它的背景信息.
nitpick:我认为将classes
变量作为类属性会更清楚,因为每个partial_fit
调用此变量都必须相同.
如果您将不同的值传递给构造函数classes
参数,那么在这里您可能做错了事.
As far as I can tell there's not much wrong with that, but we really need more context as to how you're using it here.
A nitpick: I feel it'd be clearer if you put the classes
variable as a class attribute, since this variable needs to be the same for each partial_fit
call.
Here you might be doing something wrong if you pass different values to the constructor classes
argument.
更多可以帮助我们帮助您的信息:
More information that could help us help you:
- X形,y形的印刷品.
- 上下文:您如何使用提供的代码?
- 您在
_vectorizer
,_encoder
中使用什么?您最终将使用哪个分类器?
- Prints of X.shape, y.shape.
- Context: how are you using the code you provided??
- What are you using for
_vectorizer
,_encoder
? What classifier are you ultimately working with?
这篇关于scikit中的partial_fit中遇到的错误学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!