Weka中的PCA的时间太长运行 [英] Weka's PCA is taking too long to run

查看:1560
本文介绍了Weka中的PCA的时间太长运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用PCA算法使用weka的特征选择。

I am trying to use Weka for feature selection using PCA algorithm.

我的原始特征空间包含〜9000属性,2700年的样本。
我尝试使用下面的code,以减少数据的维度:

My original feature space contains ~9000 attributes, in 2700 samples.
I tried to reduce dimensionality of the data using the following code:

AttributeSelection selector = new AttributeSelection();
PrincipalComponents pca = new PrincipalComponents();
Ranker ranker = new Ranker();
selector.setEvaluator(pca);
selector.setSearch(ranker);
Instances instances = SamplesManager.asWekaInstances(trainSet);
try { 
    selector.SelectAttributes(instances);
    return SamplesManager.asSamplesList(selector.reduceDimensionality(instances));
} catch (Exception e ) {
            ...
}

不过,这并没有完成对12小时内运行。它是停留在方法 selector.SelectAttributes(实例);

我的问题是: 如此之长的计算时间预计WEKA的PCA?还是我使用PCA错?

My questions are: Is so long computation time expected for weka's PCA? Or am I using PCA wrongly?

如果从长远来看,时间预计:
我怎样才能调整PCA算法运行的的更快?您能否提供一个替代? (+例如code如何使用它)?

If the long run time is expected:
How can I tune the PCA algorithm to run much faster? Can you suggest an alternative? (+ example code how to use it)?

如果不是:
我究竟做错了什么?我应该如何调用使用PCA WEKA,让我降维?

If it is not:
What am I doing wrong? How should I invoke PCA using weka and get my reduced dimensionality?

更新:的意见证实了我的怀疑,它正在采取更多的时间比预期的。
我想知道:我怎样​​才能得到PCA在Java中 - 使用WEKA或备用库。
增加了赏金为这一个。

Update: The comments confirms my suspicion that it is taking much more time than expected.
I'd like to know: How can I get PCA in java - using weka or an alternative library.
Added a bounty for this one.

推荐答案

深化在WEKA code,瓶颈正在建立的协方差矩阵,然后计算特征向量这个矩阵之后。即使尝试切换到矩阵疏林执行(我用 COLT 的<一个href="http://acs.lbl.gov/software/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html">SparseDoubleMatrix2D)没有帮助。

After deepening in the WEKA code, the bottle neck is creating the covariance matrix, and then calculating the eigenvectors for this matrix. Even trying to switch to sparsed matrix implementation (I used COLT's SparseDoubleMatrix2D) did not help.

我想出了是第一次使用第一快速的方法降低维解决方案(我用的信息增益分级器,以及基于文件frequencey过滤),然后用主成分分析的低维更远减少了。

The solution I came up with was first reduce the dimensionality using a first fast method (I used information gain ranker, and filtering based on document frequencey), and then use PCA on the reduced dimensionality to reduce it farther.

在code是比较复杂的,但它基本上可以归结为:

The code is more complex, but it essentially comes down to this:

Ranker ranker = new Ranker();
InfoGainAttributeEval ig = new InfoGainAttributeEval();
Instances instances = SamplesManager.asWekaInstances(trainSet);
ig.buildEvaluator(instances);
firstAttributes = ranker.search(ig,instances);
candidates = Arrays.copyOfRange(firstAttributes, 0, FIRST_SIZE_REDUCTION);
instances = reduceDimenstions(instances, candidates)
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(var);
ranker = new Ranker();
ranker.setNumToSelect(numFeatures);
selection = new AttributeSelection();
selection.setEvaluator(pca);
selection.setSearch(ranker);
selection.SelectAttributes(instances );
instances = selection.reduceDimensionality(wekaInstances);

不过,这种方法取得了雪上加霜然后用贪婪的信息增益和分级器,当我交叉验证的估计精度。

However, this method scored worse then using a greedy information gain and a ranker, when I cross-validated for estimated accuracy.

这篇关于Weka中的PCA的时间太长运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆