算法来完成数据的损坏矩阵 [英] Algorithm to complete a corrupted matrix of data

查看:113
本文介绍了算法来完成数据的损坏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题:

余提取的一组数据,但这些数据的一部分是不可用或缺失;针对不同的项目,我确定了10个参数:

I extracted a set of data but part of these data are either not available or missing; for different items I identified 10 parameters:

       param1   param2    ...  param10
Item 1   1220     N/A            1000
Item 2   1300     200     ...    1000
..        ...      ...

item N    N/A      1000   ...     200

N ~ 1500 and half of the values are complete

有在创建项目的隐含逻辑,所以我想填写可能的最好的预期值,这些值。

There is an implicit logic in the creation of items, so I would like to fill out these values with the best expected value possible.

示例

让我们想象一下,你有2个参数和3项。

Let's imagine you have 2 parameters and 3 items.

       param1  param2
item1    400    200
item2    200    100
item3    100     N/A

使用线性内插,你会很容易地得到参数2为项目3 = 50

With linear interpolation you would easily get param2 for item3 = 50.

我的想法:

因为我有10个参数和1500的值,我觉得做一个 PCA 上的< A HREF =htt​​p://en.wikipedia.org/wiki/Covariance_matrix相对=nofollow>的750项是完整的协方差矩阵(找到一组数据的主要方向)。

As I have 10 parameters and 1500 values, I thought of doing a PCA on the covariance matrix of the 750 items that are complete (finding the main direction of the set of data).

在PCA将导致我为我的项目一个主方向(最大特征值)和子方向项目(较小的特征值)的子集团。

The PCA will lead me to one main direction for my items (largest eigen value), and sub direction for sub groups of items (smaller eigen values).

我想项目,缺少的参数上,例如在主要方向的矢量。得到缺少的参数的近似值。

I wanted to project the vectors with missing parameters on the main direction for example. to get the approximate value of the missing parameters.

从我的第一个例子:

       param1  param2
item1    400    200
item2    200    100
item3    100     X ?

完整矩阵:

param1  param2
item1    400    200
item2    200    100

协方差矩阵:

   1    0.5
   0.5  1 

特征向量和特征值:

V1和L1:

1
1   associatedd to 1.5

V2和L2:

1
-1  associated to 0.5

结果:

如果工程对V1只有我得到 X1 = 100

If I project on V1 only I get X1=100.

如果工程对 l1.V1 + l2.V2 我得到 X1 = 50 。这是因为在第一个2项之间的完全相关。

If I project on l1.V1 + l2.V2 I get X1=50. This is because there is a perfect correlation between the first 2 items.

所以我的问题:

到目前为止,这只是理论,我还没有申请过,但在我开始,我想知道如果我要和这个地方。

So far it's only theory, I haven't applied it yet, but before I start I would like to know if I'm going somewhere with this.

我可以做的更好? (我真的相信是的。) 我能做些什么,如果所有的项目都有一个缺少参数?从何处获得的方向?

Can I do better? (I really believe yes.) What can I do if all items have one missing parameter? Where do I get the direction from?

是否有已知良好的算法来填补损坏的矩阵,或者你能不能帮我完成我的想法(推荐给了我很好的读数或方法)?

Are there known good algorithms to fill in corrupted matrices, or can you help me complete my idea (recommending to me good readings or methods)?

我认为Netflix公司采用这种算法的自动填充在影片中得分矩阵为例(Netflix的1M美元的问题)。

I think Netflix uses this kind of algorithm to fill in the film score matrix automatically for example (Netflix 1M dollar problem).

如果你认为这属于另一个stackexchange网站,随意迁移了。

If you believe this belongs to another stackexchange site, feel free to migrate it.

推荐答案

为什么不使用从机数字predictions学习?在你的第一个例子PARAMS是属性和项目实例。有了它,你可以尝试的线性回归神经网络或任何其他在一两分钟。训练结束后,你会得到下一个方程的第一个例子(参数2在这里被标记为一类):

Why not to use numeric predictions from machine learning? In your first example params are attributes and items are instances. With it you can try linear regression or neural networks or anything else in a couple of minutes. After training you will get next equation for your first example (param2 here is marked as a class):

param2 = 0 + 1/2 * param1

而这正是你想要的。

which is exactly what you want.

如果您不知道该PARAMS之间的关系是线性的,你总是可以尝试其他类型的回归(ANN,SVM,任何东西)。

If you're not sure that relations between params are linear, you can always try other types of regression (ANN, SVM, anything).

有关快速启动使用 Weka的。您的数据转换为CSV,将其加载到Weka中,并开始播放。对于数字predictions看分类选项卡。

For a quick start use Weka. Convert your data to CSV, load it into Weka and start playing. For numeric predictions look at "Classification" tab.

这篇关于算法来完成数据的损坏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆