对二进制数据使用主成分分析(PCA) [英] Using Principal Components Analysis (PCA) on binary data

查看:194
本文介绍了对二进制数据使用主成分分析(PCA)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在二进制属性上使用PCA来减小问题的尺寸(属性).初始尺寸为592,在PCA之后为497.我在其他问题的数值属性之前使用PCA,它设法在更大程度上减小了尺寸(初始尺寸的一半).我相信二进制属性会降低PCA的功能,但我不知道为什么.您能否解释一下为什么PCA不能像数字数据那样好.

I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data.

谢谢.

推荐答案

0/1数据的主要成分可能会缓慢或快速下降, 并且连续数据的PC也— 这取决于数据.你能描述你的数据吗?

The principal components of 0/1 data can fall off slowly or rapidly, and the PCs of continuous data too — it depends on the data. Can you describe your data ?

下图旨在比较连续图像数据的PC 与PC的相同数据量化为0/1:在这种情况下,没有定论.

The following picture is intended to compare the PCs of continuous image data vs. the PCs of the same data quantized to 0/1: in this case, inconclusive.

将PCA视为近似大矩阵的一种方法,
首先有一个词:近似A〜c U V T ,c [Ui Vj].
考虑一下这一点,假设A表示10k x 500:U长度为10k,V长度为500. 第一行是c U1 V,第二行是c U2 V ... 所有行都与V成正比. 同样,最左边的列是c U V1 ... 所有列都与U成正比.
但是,如果所有行都相似(彼此成比例), 它们无法接近具有行或列的A matix 0100010101 ...
还有更多项,A〜c1 U1 V1 T + c2 U2 V2 T + ..., 我们可以更接近A:c i 越大,速度越快. (当然,所有500个字词都会在舍入误差内准确地重新创建A.)

Look at PCA as a way of getting an approximation to a big matrix,
first with one term: approximate A ~ c U VT, c [Ui Vj].
Consider this a bit, with A say 10k x 500: U 10k long, V 500 long. The top row is c U1 V, the second row is c U2 V ... all the rows are proportional to V. Similarly the leftmost column is c U V1 ... all the columns are proportional to U.
But if all rows are similar (proportional to each other), they can't get near an A matix with rows or columns 0100010101 ...
With more terms, A ~ c1 U1 V1T + c2 U2 V2T + ..., we can get nearer to A: the smaller the higher ci, the faster.. (Of course, all 500 terms recreate A exactly, to within roundoff error.)

最上面一行是"lena",这是众所周知的512 x 512矩阵, 具有1项和10项SVD近似值. 最底行的lena离散为0/1,再次为1项和10项. 我认为0/1 lena会更糟-评论,有人吗?

The top row is "lena", a well-known 512 x 512 matrix, with 1-term and 10-term SVD approximations. The bottom row is lena discretized to 0/1, again with 1 term and 10 terms. I thought that the 0/1 lena would be much worse -- comments, anyone ?

(U V T 也写为U⊗ V,称为"dyad"或外部产品".)

(U VT is also written U ⊗ V, called a "dyad" or "outer product".)

(维基百科文章 奇异值分解低秩逼近 有点数学重. AMS专栏作者 大卫·奥斯汀 我们建议采用奇异值分解 给出了有关SVD/PCA的一些直觉-强烈建议.)

(The wikipedia articles Singular value decomposition and Low-rank approximation are a bit math-heavy. An AMS column by David Austin, We Recommend a Singular Value Decomposition gives some intuition on SVD / PCA -- highly recommended.)

这篇关于对二进制数据使用主成分分析(PCA)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆