对二进制数据使用主成分分析(PCA) [英] Using Principal Components Analysis (PCA) on binary data

查看：194 发布时间：2020/7/31 4:10:48 pca svd

本文介绍了对二进制数据使用主成分分析(PCA)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在二进制属性上使用PCA来减小问题的尺寸(属性).初始尺寸为592，在PCA之后为497.我在其他问题的数值属性之前使用PCA，它设法在更大程度上减小了尺寸(初始尺寸的一半).我相信二进制属性会降低PCA的功能，但我不知道为什么.您能否解释一下为什么PCA不能像数字数据那样好.

I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data.

谢谢.

推荐答案

0/1数据的主要成分可能会缓慢或快速下降，并且连续数据的PC也— 这取决于数据.你能描述你的数据吗?

The principal components of 0/1 data can fall off slowly or rapidly, and the PCs of continuous data too — it depends on the data. Can you describe your data ?

下图旨在比较连续图像数据的PC 与PC的相同数据量化为0/1:在这种情况下，没有定论.

The following picture is intended to compare the PCs of continuous image data vs. the PCs of the same data quantized to 0/1: in this case, inconclusive.

将PCA视为近似大矩阵的一种方法，
首先有一个词:近似A〜c U V ^T，c [Ui Vj].
考虑一下这一点，假设A表示10k x 500:U长度为10k，V长度为500. 第一行是c U1 V，第二行是c U2 V ... 所有行都与V成正比. 同样，最左边的列是c U V1 ... 所有列都与U成正比.
但是，如果所有行都相似(彼此成比例)，它们无法接近具有行或列的A matix 0100010101 ...
还有更多项，A〜c1 U1 V1 ^T + c2 U2 V2 ^T + ...，我们可以更接近A:c _i越大，速度越快. (当然，所有500个字词都会在舍入误差内准确地重新创建A.)

Look at PCA as a way of getting an approximation to a big matrix,
first with one term: approximate A ~ c U V^T, c [Ui Vj].
Consider this a bit, with A say 10k x 500: U 10k long, V 500 long. The top row is c U1 V, the second row is c U2 V ... all the rows are proportional to V. Similarly the leftmost column is c U V1 ... all the columns are proportional to U.
But if all rows are similar (proportional to each other), they can't get near an A matix with rows or columns 0100010101 ...
With more terms, A ~ c1 U1 V1^T + c2 U2 V2^T + ..., we can get nearer to A: the smaller the higher c_i, the faster.. (Of course, all 500 terms recreate A exactly, to within roundoff error.)

最上面一行是"lena"，这是众所周知的512 x 512矩阵，具有1项和10项SVD近似值. 最底行的lena离散为0/1，再次为1项和10项. 我认为0/1 lena会更糟-评论，有人吗?

The top row is "lena", a well-known 512 x 512 matrix, with 1-term and 10-term SVD approximations. The bottom row is lena discretized to 0/1, again with 1 term and 10 terms. I thought that the 0/1 lena would be much worse -- comments, anyone ?

(U V ^T也写为U⊗ V，称为"dyad"或外部产品".)

(U V^T is also written U ⊗ V, called a "dyad" or "outer product".)

(维基百科文章奇异值分解和低秩逼近有点数学重. AMS专栏作者大卫·奥斯汀我们建议采用奇异值分解给出了有关SVD/PCA的一些直觉-强烈建议.)

(The wikipedia articles Singular value decomposition and Low-rank approximation are a bit math-heavy. An AMS column by David Austin, We Recommend a Singular Value Decomposition gives some intuition on SVD / PCA -- highly recommended.)

这篇关于对二进制数据使用主成分分析(PCA)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

对二进制数据使用主成分分析(PCA) [英] Using Principal Components Analysis (PCA) on binary data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

对二进制数据使用主成分分析(PCA) [英] Using Principal Components Analysis (PCA) on binary data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭