对二进制数据使用主成分分析(PCA) [英] Using Principal Components Analysis (PCA) on binary data
问题描述
我在二进制属性上使用PCA来减小问题的尺寸(属性).初始尺寸为592,在PCA之后为497.我在其他问题的数值属性之前使用PCA,它设法在更大程度上减小了尺寸(初始尺寸的一半).我相信二进制属性会降低PCA的功能,但我不知道为什么.您能否解释一下为什么PCA不能像数字数据那样好.
I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data.
谢谢.
推荐答案
0/1数据的主要成分可能会缓慢或快速下降, 并且连续数据的PC也— 这取决于数据.你能描述你的数据吗?
The principal components of 0/1 data can fall off slowly or rapidly, and the PCs of continuous data too — it depends on the data. Can you describe your data ?
下图旨在比较连续图像数据的PC 与PC的相同数据量化为0/1:在这种情况下,没有定论.
The following picture is intended to compare the PCs of continuous image data vs. the PCs of the same data quantized to 0/1: in this case, inconclusive.
将PCA视为近似大矩阵的一种方法,
首先有一个词:近似A〜c U V T ,c [Ui Vj].
考虑一下这一点,假设A表示10k x 500:U长度为10k,V长度为500.
第一行是c U1 V,第二行是c U2 V ...
所有行都与V成正比.
同样,最左边的列是c U V1 ...
所有列都与U成正比.
但是,如果所有行都相似(彼此成比例),
它们无法接近具有行或列的A matix 0100010101 ...
还有更多项,A〜c1 U1 V1 T + c2 U2 V2 T + ...,
我们可以更接近A:c i 越大,速度越快.
(当然,所有500个字词都会在舍入误差内准确地重新创建A.)
Look at PCA as a way of getting an approximation to a big matrix,
first with one term: approximate A ~ c U VT, c [Ui Vj].
Consider this a bit, with A say 10k x 500: U 10k long, V 500 long.
The top row is c U1 V, the second row is c U2 V ...
all the rows are proportional to V.
Similarly the leftmost column is c U V1 ...
all the columns are proportional to U.
But if all rows are similar (proportional to each other),
they can't get near an A matix with rows or columns 0100010101 ...
With more terms, A ~ c1 U1 V1T + c2 U2 V2T + ...,
we can get nearer to A: the smaller the higher ci, the faster..
(Of course, all 500 terms recreate A exactly, to within roundoff error.)
最上面一行是"lena",这是众所周知的512 x 512矩阵, 具有1项和10项SVD近似值. 最底行的lena离散为0/1,再次为1项和10项. 我认为0/1 lena会更糟-评论,有人吗?
The top row is "lena", a well-known 512 x 512 matrix, with 1-term and 10-term SVD approximations. The bottom row is lena discretized to 0/1, again with 1 term and 10 terms. I thought that the 0/1 lena would be much worse -- comments, anyone ?
(U V T 也写为U⊗ V,称为"dyad"或外部产品".)
(U VT is also written U ⊗ V, called a "dyad" or "outer product".)
(维基百科文章
奇异值分解
和
(The wikipedia articles Singular value decomposition and Low-rank approximation are a bit math-heavy. An AMS column by David Austin, We Recommend a Singular Value Decomposition gives some intuition on SVD / PCA -- highly recommended.)
这篇关于对二进制数据使用主成分分析(PCA)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!