分类和序数特征数据在回归分析中的区别? [英] Categorical and ordinal feature data difference in regression analysis?

查看:55
本文介绍了分类和序数特征数据在回归分析中的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在进行回归分析时,我试图完全理解分类数据和有序数据之间的差异.现在,很明显:

I am trying to completely understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear:

分类特征和数据示例:
颜色:红色,白色,黑色
为什么分类: red<白色<黑色在逻辑上不正确

常规功能和数据示例:
条件:旧的,翻新的,新的
为什么按序: old<装修<新在逻辑上正确

分类数字和序数编码方法:
分类数据的一键编码
序数数据的任意数字

Categorical-to-numeric and ordinal-to-numeric encoding methods:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data

分类示例:

data = {'color': ['blue', 'green', 'green', 'red']}

一键编码后的数字格式:

Numeric format after One-Hot encoding:

   color_blue  color_green  color_red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

序数示例:

data = {'con': ['old', 'new', 'new', 'renovated']}

使用映射后的数字格式:旧<装修<新→0、1、2

Numeric format after using mapping: Old < renovated < new → 0, 1, 2

0    0
1    2
2    2
3    1

我的数据价格随着条件从旧"变为新"而增加.数字中的旧"被编码为"0".数字中的新"被编码为"2".因此,随着条件的增加,价格也会随之增加.正确.
现在让我们来看看颜色"功能.就我而言,不同的颜色也会影响价格.例如,黑色"将比白色"昂贵.但是从上面提到的分类数据的数字表示中,我没有看到依赖度随着条件"功能的增加而增加.这是否意味着如果使用单热编码,颜色变化不会影响回归模型中的价格?如果仍然不影响价格,为什么要使用一键编码进行回归?你能澄清一下吗?

In my data price increases as condition changes from "old" to "new". "Old" in numeric was encoded as '0'. 'New' in numeric was encoded as '2'. So, as condition increases, then price also increases. Correct.
Now lets have a look at 'color' feature. In my case, different colors also affect price. For example, 'black' will be more expensive than 'white'. But from above mentioned numeric representation of categorical data, I do not see increasing dependancy as it was with 'condition' feature. Does it mean that change in color does not affect price in regression model if using one-hot encoding? Why to use one-hot encoding for regression if it does not affect price anyway? Can you clarify it?


问题更新:
首先,我介绍线性回归的公式:
让我们看看颜色的数据表示形式:让我们使用两种数据表示的公式来预测第一和第二项目的价格:
一键编码:在这种情况下,将存在不同颜色的不同theta,并且预测将是:


UPDATE TO QUESTION:
First I introduce formula for linear regression:
Let have a look at data representations for color: Let's predict price for 1-st and 2-nd item using formula for both data representations:
One-hot encoding: In this case different thetas for different colors will exist and prediction will be:

Price (1 item) = 0 + 20*1 + 50*0 + 100*0 = 20$  (thetas are assumed for example)
Price (2 item) = 0 + 20*0 + 50*1 + 100*0 = 50$  (thetas are assumed for example)

颜色的常规编码:在这种情况下,所有颜色都有相同的θ,但乘数不同:

Ordinal encoding for color: In this case all colors have common theta but multipliers differ:

Price (1 item) = 0 + 20*10 = 200$  (theta assumed for example)
Price (2 item) = 0 + 20*20 = 400$  (theta assumed for example)

在我的模特中,怀特<红色<价格黑.在这两种情况下似乎都是合乎逻辑的预测.用于序数和分类表示.所以我可以使用任何编码进行回归,而与数据类型(分类或有序)无关吗?这种划分仅仅是惯例和面向软件的表示方式的问题,而不是回归逻辑本身的问题?

In my model White < Red < Black in prices. Seem to be that it is logical predictions in both cases. For ordinal and categorical representations. So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?

推荐答案

您将看到依赖性没有增加.正如上面已经指出的那样,这种区别的全部意义在于,颜色不是 个特征,您可以将其有意义地放置在连续体上.

You will see not increasing dependency. The whole point of this discrimination is that colour is not a feature you can meaningfully place on a continuum, as you've already noted.

一键编码使软件非常方便地分析此维.您可以使用一组布尔(存在/不存在)功能来代替具有列出值的功能颜色".例如,您上面的第0行具有color_blue = true,color_green = false和color_red = false的功能.

The one-hot encoding makes it very convenient for the software to analyze this dimension. Instead of having a feature "colour" with the listed values, you have a set of boolean (present / not-present) features. For instance, your row 0 above has features color_blue = true, color_green = false, and color_red = false.

您获得的预测数据应将它们分别显示为一个单独的维度.例如,color_blue的值可能是$ 200,而绿色的值是-$ 100.

The prediction data you get should show each of these as a separate dimension. For instance, presence of color_blue may be worth $200, while green is -$100.

摘要:不要在(不存在的)颜色轴上寻找线性回归线;而是寻找color_ *因素,每种颜色一个.就您的分析算法而言,这些是完全独立的功能.单热"编码(来自数字电路设计的一个术语)仅仅是我们处理该问题的惯例.

Summary: don't look for a linear regression line running across a (non-existent) color axis; rather, look for color_* factors, one for each color. As far as your analysis algorithm is concerned, these are utterly independent features; the "one-hot" encoding (a term from digital circuit design) is merely our convention for dealing with this.

在您对问题进行编辑后,2015年12月4日Z:03:

After your edit of the question 02:03 Z 04 Dec 2015:

不,您的假设不正确:这两种表示方式不只是为了方便.颜色的顺序在此示例中适用-因为效果恰好是所选编码的线性函数.如您的示例所示,更简单的编码假定白到红到黑的定价是线性的.当绿色、蓝色和棕色都是 25 美元,稀有的黄色价值 500 美元,透明的价格降低 1,000 美元时,你会怎么做?

No, your assumption is not correct: the two representations are not merely a matter of convenience. The ordering of colors works for this example -- because the effect happens to be a neat, linear function of the chosen encoding. As your example shows, your simpler encoding assumes that White-to-Red-to-Black pricing is a linear progression. What do you do when Green, Blue, and Brown are all $25, the rare Yellow is worth $500, and Transparent reduces the price by $1,000?

另外,你怎么事先知道黑比白更值钱,反过来又比红更值钱?

Also, how is it that you know in advance that Black is worth more than White, in turn worth more than Red?

考虑以小学地区为基础的房价情况,该地区有50个地区.如果您使用数字编码-学区编号,字母的序数位置或其他任意顺序-回归软件将很难找到该数字与房价之间的相关性.PS 107比PS 32或PS 15贵吗?Addington和Bendemeer会优先于Union City和Ventura吗?

Consider the case of housing prices based on elementary school district, with 50 districts in the area. If you use a numerical coding -- school district number, ordinal position alphabetically, or some other arbitrary ordering -- the regression software will have great trouble finding a correlation between that number and the housing price. Is PS 107 a more expensive district than PS 32 or PS 15? Are Addington and Bendemeer preferred to Union City and Ventura?

根据一键通"原理,将这些特征分成50个不同的特征,从而使特征与编码脱钩,并允许分析软件以数学上有意义的方式对其进行处理.无论如何,它都不是完美的-从20个功能扩展到70个功能意味着收敛需要更长的时间-但是我们 do 对于学区来说,会获得有意义的结果.

Splitting these into 50 different features under that one-hot principle decouples the feature from the encoding, and allows the analysis software to treat with them in a mathematically meaningful manner. It's not perfect by any means -- expanding from, say, 20 features to 70 means that it will take longer to converge -- but we do get meaningful results for the school district.

如果您愿意,您现在可以可以以期望的值顺序对该功能进行编码,并获得合理的拟合度,而不会损失准确性,并且可以从模型中更快地进行预测(变量较少).

If you wish, you could now encode that feature in the expected order of value, and get a reasonable fit with little loss of accuracy and faster prediction from your model (fewer variables).

这篇关于分类和序数特征数据在回归分析中的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆