如何处理SVR任务中的高维稀疏功能? [英] How could I deal with the sparse feature with high dimension in an SVR task?

查看:149
本文介绍了如何处理SVR任务中的高维稀疏功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似Twitter的数据集(另一个微博客),该数据集具有160万个数据点,并试图根据其内容来预测其转推数.我提取了它的关键字,并将关键字用作单词袋功能.然后,我得到了120万个尺寸特征.特征向量非常稀疏,通常在一个数据点中只有10维.我使用SVR进行回归.现在花了2天.我认为培训时间可能会花费很长时间.我不知道我是否像这样正常地执行此任务.有什么方法或有必要优化此问题吗?
顺便提一句.如果在这种情况下,我不使用任何内核,并且该计算机是32GB RAM和i-7 16核.估计培训时间需要多长时间?我用的是lib pyml.

I have a twitter-like(another micro blog) data set with 1.6 million datapoints and tried to predict the its retweet numbers based on its content. I extracted its keyword and use the keywords as the bag of words feature. Then I got 1.2 million dimension feature. The feature vector is very sparse,usually only ten dimension in one data point. And I use SVR to do the regression. Now it has taken 2 days. I think the training time might take quite a long time. I don't know if I do this task like this is normal. Is there any way or is it necessary to optimize this problem?
BTW. If in this case , I don't use any kernel and the machine is 32GB RAM and i-7 16 cores. How long the training time will be in estimation? I used the lib pyml.

推荐答案

您需要找到一种适用于您的问题的降维方法.

You need to find a dimensionality reduction approach that works for your problem.

我正在研究与您类似的问题,我发现Information Gain效果很好,但是还有其他问题.

I've worked on a similar problem to yours and I found that Information Gain worked well, but there are others.

我发现该论文(Fabrizio Sebastiani,《自动文本分类中的机器学习》,ACM计算调查,2002年,第34卷,第1期,第1-47页)是对文本分类(包括特征)进行很好的理论处理.从简单(术语频率)到复杂(信息理论)的各种方法进行还原.

I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).

这些功能试图抓住直觉,即ci的最佳术语是 在正面和负面例子的集合中分布最不一样的 ci.但是,此原理的解释因功能不同而异.例如,在实验科学中,χ2用于测量观察结果与根据初始假设所期望的结果之间的差异(即独立)(较低的值表示较低的依赖性).在DR中,我们测量tk和ci的独立性.因此,具有χ2(tk,ci)的最小值的项tk与ci最独立;由于我们对不感兴趣的词感兴趣,因此我们选择χ2(tk,ci)最高的词.

These functions try to capture the intuition that the best terms for ci are the ones distributed most differently in the sets of positive and negative examples of ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.

这些技术可帮助您选择对将培训文档划分为给定类别最有用的术语;对您的问题具有最高预测价值的字词.

These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem.

我已经成功地使用Information Gain进行了特征约简,并且发现本文(基于熵的文本分类用于文本分类Largeron,Christine和Moulin,Christophe和Géry,Mathias-SAC-第924-928页)良好的实用指南.

I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.

作者在这里提出了一种基于熵的特征选择的简单公式,该特征选择对于在代码中实现非常有用:

Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:

给定项tj和类别ck,则ECCD(tj,ck)可以为 根据列联表计算得出.设A为数字 包含tj类别的文档; B,数量 包含tj的其他类别的文​​件; C, 不包含tj和D的ck的文档数, 其他类别中的文档数量 不包含tj(N = A + B + C + D):

Given a term tj and a category ck, ECCD(tj , ck) can be computed from a contingency table. Let A be the number of documents in the category containing tj ; B, the number of documents in the other categories containing tj ; C, the number of documents of ck which do not contain tj and D, the number of documents in the other categories which do not contain tj (with N = A + B + C + D):

使用此列联表,可以通过以下方式估算信息增益:

Using this contingency table, Information Gain can be estimated by:

这种方法易于实现,并且可以很好地减少信息论的功能.

This approach is easy to implement and provides very good Information-Theoretic feature reduction.

您也不需要使用任何一种技术.您可以将它们结合起来. Ter-Frequency很简单,但也可以有效.我已将信息增益"方法与术语频率"结合使用,以成功进行特征选择.您应该对数据进行实验,以查看哪种技术最有效.

You needn't use a single technique either; you can combine them. Ter-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.

这篇关于如何处理SVR任务中的高维稀疏功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆