标称属性的虚拟编码 - 使用该K傻瓜,属性选择的影响的影响 [英] Dummy Coding of Nominal Attributes - Effect of Using K Dummies, Effect of Attribute Selection

查看:281
本文介绍了标称属性的虚拟编码 - 使用该K傻瓜,属性选择的影响的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

总结我的话题的理解哑编码通常被理解为编码具有K可能值标称属性作为K-1的二进制假人。的K值的使用会导致冗余,并会产生不利的影响例如在回归,据我了解到的。那么远,一切都清晰。

Summing up my understanding of the topic 'Dummy Coding' is usually understood as coding a nominal attribute with K possible values as K-1 binary dummies. The usage of K values would cause redundancy and would have a negative impact e.g. on logistic regression, as far as I learned it. That far, everything's clear to me.

然而,有两个问题是我不清楚:

Yet, two issues are unclear to me:

1)铭记问题如上所述,我迷茫了的'物流'的分类在WEKA实际使用ķ假人(见图片)。的为什么会出现这种情况?

1) Bearing in mind the issue stated above, I am confused that the 'Logistic' classifier in WEKA actually uses K dummies (see picture). Why would that be the case?

2)的一个问题,只要我认为属性选择出现。其中左出属性值被隐式包括在其中所有假人为零,如果所有的假人实际用于模型的情况下,它没有清楚地包含了,如果一个虚设缺失(如在属性选择不选择)。这个问题是非常容易与我上传草图理解。的如何这个问题可以治疗吗?

2) An issue arises as soon as I consider attribute selection. Where the left-out attribute value is implicitly included as the case where all dummies are zero if all dummies are actually used for the model, it isn't included clearly anymore, if one dummy is missing (as not selected in attribute selection). The issue is much easy to understand with the sketch I uploaded. How can that issue be treated?

其次

图片

Images

WEKA输出:的Logistic算法对UCI数据集德语信用卡,其中第一个属性的可能值是A11,A12,A13,A14运行。所有这些都包括在逻辑回归模型。 http://abload.de/img/bildschirmfoto2013-089out9.png

WEKA Output: The Logistic algorithm was run on the UCI dataset German Credit, where the possible values of the first attribute are A11,A12,A13,A14. All of them are included in the logistic regression model. http://abload.de/img/bildschirmfoto2013-089out9.png

决策树示例:草图显示,当涉及到与属性选择后dummy- codeD情况下的数据集运行决策树的问题。 http://abload.de/img/sketchziu5s.jpg

Decision Tree Example: Sketch showing the issue when it comes to running decision trees on datasets with dummy-coded instances after attribute selection. http://abload.de/img/sketchziu5s.jpg

推荐答案

输出一般比较轻松阅读,除pret和使用:当您使用k假人代替K-1假人。我想,这就是为什么大家似乎实际使用k假人。
但是,当K值总结为1,存在可能导致问题的一个相关关系。但在数据集的相关性是常见的,你永远不会完全摆脱他们!

The output is generally more easy to read, interpret and use when you use k dummies instead of k-1 dummies. I figure that is why everybody seems to actually use k dummies. But yes, as the k values sum up to 1, there exists a correlation that may cause problems. But correlations in data sets are common, you will never completely get rid of them!

我相信特征选择和虚拟编码只是不适合。它等于放弃一些值从属性即可。为什么你坚持做特征选择?

I believe feature selection and dummy coding just doesn't fit. It equals dropping some values from the attribute. Why do you insist on doing feature selection?

您真的应该使用权,或考虑更先进的算法,可以处理这些数据。事实上在虚拟变量会导致同样多的麻烦,因为它们是二进制的,呵呵这么多的算法(例如K-均值)不以二元变量太大的意义。

You really should be using weighting, or consider more advanced algorithms that can handle such data. In fact the dummy variables can cause just as much trouble, because they are binary, and oh so many algorithms (e.g. k-means) don't make much sense on binary variables.

对于决策树:不上进行功能选择你的输出的属性 ...
另外,作为一个决策树已经选择的功能,它没有任何意义反正做这一切......它留给决策树来确定在其属性用于分裂。这样,就可以学习的依赖了。

As for the decision tree: don't perform, feature selection on your output attribute... Plus, as a decision tree already selects features, it does not make sense to do all this anyway... leave it to the decision tree to decide upon which attribute to use for splitting. This way, it can learn dependencies, too.

这篇关于标称属性的虚拟编码 - 使用该K傻瓜,属性选择的影响的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆