处理,准备词袋数据以进行回归 [英] Working with, preparing bag-of-word data for Regression

查看:45
本文介绍了处理,准备词袋数据以进行回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个预测作者年龄的回归模型.我以(Nguyen et al,2011)为基础.

Im trying to create a regression model that predicts an authors age. Im using (Nguyen et al,2011) as my basis.

使用一袋单词模型,我计算每个文档中单词的出现次数(这些单词是来自董事会的帖子),并为每个帖子创建向量.

Using a Bag of Words Model I count the occurences of words per Document (which are Posts from Boards) and create the vector for every Post.

我通过使用最常用的前k个(k =数字)个词(不使用停用词)来限制每个向量的大小

I limit the size of each vector by using as features the top-k (k=number) most frequent used words(stopwords will not be used)

Vectorexample_with_k_8 = [0,0,0,1,0,3,0,0]

我的数据通常像示例中那样稀疏.

My data is generally sparse like in the Example.

当我在测试数据上测试模型时,我得到的r²得分非常低(0.00-0.1),有时甚至是负得分.模型预测的年龄总是相同的,恰好是我的数据集的平均年龄,就像在 我的数据分布(年龄/数量):

When I test the model on my test data I get a very low r² score(0.00-0.1), sometimes even a negative score. The model predicts always the same age, which happens to be the average age of my dataset, like seen in the distribution of my data (age/amount):

我使用了差分回归模型:线性回归,套索, scikit-learn的SGDRegressor没有改进.

I used diffrerent Regression Models: Linear Regression, Lasso, SGDRegressor from scikit-learn with no improvement.

所以问题是:

1.如何提高r²得分?

2.是否需要更改数据以更好地适应回归?如果可以,采用哪种方法?

3.我应该使用哪种回归器/方法进行文本分类?

推荐答案

据我所知,词袋模型通常使用朴素贝叶斯作为分类器来拟合逐项文档稀疏矩阵.

To my knowledge Bag-of-words models usually use Naive Bayes as classifier to fit the document-by-term sparse matrix.

您的回归器都无法很好地处理大型稀疏矩阵.如果您有一组高度相关的功能,套索可能会很好地工作.

None of your regressors can handle large sparse matrix well. Lasso may work well if you have groups of highly correlated features.

我认为对于您的问题,潜在语义分析可能会提供更好的结果.本质上,请使用 TfidfVectorizer 对该词进行规范化计数矩阵,然后使用 TruncatedSVD 将维数减小到保留捕获主要方差的前N个分量.大多数回归器应在较低维度上与矩阵配合良好.在我的经验中,SVM可以很好地解决此问题.

I think for your problem, Latent Semantic Analysis may provide better results. Essentially, use the TfidfVectorizer to normalize the word count matrix, then use TruncatedSVD to reduce the dimensionality to retain the first N components which capture the major variance. Most regressors should work well with the matrix in lower dimension. In my experimence SVM works pretty good for this problem.

在此显示示例脚本:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svd', TruncatedSVD()),
    ('clf', svm.SVR())
])
# You can tune hyperparameters using grid search
params = {
    'tfidf__max_df': (0.5, 0.75, 1.0),
    'tfidf__ngram_range': ((1,1), (1,2)),
    'svd__n_components': (50, 100, 150, 200),
    'clf__C': (0.1, 1, 10),
    }
grid_search = GridSearchCV(pipeline, params, scoring='r2',
    n_jobs=-1, verbose=10)
# fit your documents (Should be a list/array of strings)
grid_search.fit(documents, y)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

这篇关于处理,准备词袋数据以进行回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆