使用PassiveAggressiveClassifier的partial_fit在Scikit中进行增量学习 [英] Incremental Learning in Scikit with PassiveAggressiveClassifier's partial_fit

查看:77
本文介绍了使用PassiveAggressiveClassifier的partial_fit在Scikit中进行增量学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在以下脚本中使用TfidVectorizerpartial_fit技术训练PassiveAggressiveClassifier:

I'm trying to train a PassiveAggressiveClassifier using TfidVectorizer with partial_fit technique in the script below:

代码已更新:

a, ta = [], []
r, tr = [], []
g = []

vect = HashingVectorizer(ngram_range=(1,4))
model = PassiveAggressiveClassifier()
with open('files', 'rb') as f:
    for line in f:
        line = line.strip()
        with open('gau-' + line + '.csv', 'rb') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                res = row['gau']
                g.append(res)

        cls = np.unique(g)
        print(len(cls))

        with open('gau-' + line + '.csv', 'rb') as csvfile:
            reader = csv.DictReader(csvfile)
            i = 0
            j = True
            for row in reader:
                arr = row['text']
                res = row['gau']
                a.append(arr)
                if(len(res) > 0):
                    r.append(int(res))
                i = i + 1

                if i % 400 == 0:
                    training_set = vect.fit_transform(a)
                    print(training_set.shape)
                    training_result = np.array(r)
                    model = model.partial_fit(
                        training_set, training_result, classes=cls)
                    a, r, i = [], [], 0

        print(model)
        testing_set = vect.transform(ta)
        testing_result = np.array(tr)
        predicted = model.predict(testing_set)

        print "Result to be predicted: "+testing_result
        print "Prediction: "+predicted

有多个CSV文件,每个文件包含4k-5k记录,我正在尝试使用partial_fit函数一次容纳400条记录.运行此代码时,遇到以下错误:

There are multiple CSV files each containing 4k-5k records and I am trying to fit 400 records at a time using the partial_fit function. When I ran this code, I ran into the following error:

Result to be predicted: 1742
Prediction: 2617

如何解决此问题?我的CSV文件中的记录的长度是可变的.

How do I resolve this issue? The records in my CSV files are of variable length.

更新:

HashingVectorizer替换TfidVectorizer,我成功创建了模型,但是现在在对测试数据执行预测时,生成的预测都是不正确的. 我的训练数据包含数百万行的csv文件,每行最多包含4k-5k的文字.

Replacing TfidVectorizer with HashingVectorizer, I successfully created my model, but now while executing prediction on my test data the predictions generated were all incorrect. My training data contains millions of lines of csv files and each line contains at most 4k-5k words of text.

那么我的方法是否有问题,即这些算法可以用于我的数据吗?

So Is there any problem with my approach i.e. can these algorithms can be used with my data?

推荐答案

这是我从您的问题中了解的信息.

This is what i understand from your problem.

1)您需要应用局部拟合模型来进行在线培训.

1) You have a requirement to apply the partial fit model to do the online training.

2)您的功能空间很大.

2) Your feature space is so huge.

如果我做对了,那么我将面临同样的问题.而且,如果您将使用HashingVectorizer,则很有可能发生按键碰撞.

If I got it right then I faced the same problem. And if you will use the HashingVectorizer, there are high chances of key collision.

HashingVectorizer文档

还有一些缺点(与使用CountVectorizer和 内存词汇):无法计算逆函数 可以转换(从特征索引到字符串特征名称) 内省哪些功能最重要时出现问题 到模型.可能会发生冲突:不同的令牌可以映射到 相同的特征索引.但是在实践中,如果 n_features足够大(例如2 ** 18用于文本分类 问题).没有IDF加权,因为这将导致变压器 有阶段性的.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary): there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model. there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems). no IDF weighting as this would render the transformer stageful.

如果按键会发生碰撞,则可能会降低准确性.

If key will collide then there are chances of reduction in accuracy.

在我的在线培训中,首先,我像这样使用partial_fit训练了分类器.

In my online training, firstly i trained the classifier with partial_fit like this.

classifier = MultinomialNB(alpha=alpha_optimized).partial_fit(X_train_tfidf,y_train,classes=np.array([0,1]))

第二天,我加载了第一天训练集的腌制分类器,count_vect和tfidf.然后,我仅将转换应用于count_vet和tfidf.而且有效

On second day i load the pickled classifier, count_vect and tfidf of first day training set. Then I only applied the transform on count_vet and tfidf. And it worked

X_train_counts = count_vect.transform(x_train)
X_train_tfidf = tfidf.transform(X_train_counts)
pf_classifier.partial_fit(X_train_tfidf,y_train)

如有任何疑问,请回复.

In case of any doubt please reply.

这篇关于使用PassiveAggressiveClassifier的partial_fit在Scikit中进行增量学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆