回归分析中的分类和有序特征数据表示? [英] Categorical and ordinal feature data representation in regression analysis?

查看:314
本文介绍了回归分析中的分类和有序特征数据表示?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在进行回归分析时,我试图完全理解分类数据和有序数据之间的差异.现在,很明显:

I am trying to fully understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear:

分类特征和数据示例:
颜色:红色,白色,黑色
为什么分类:red < white < black在逻辑上不正确

Categorical feature and data example:
Color: red, white, black
Why categorical: red < white < black is logically incorrect

常规功能和数据示例:
条件:旧的,翻新的,新的
为什么按序排列:old < renovated < new在逻辑上正确

Ordinal feature and data example:
Condition: old, renovated, new
Why ordinal: old < renovated < new is logically correct

分类数字和序数编码方法:
分类数据的一键编码
序数数据的任意数字

Categorical-to-numeric and ordinal-to-numeric encoding methods:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data

分类数据转换为数字:

data = {'color': ['blue', 'green', 'green', 'red']}

一键编码后的数字格式:

Numeric format after One-Hot encoding:

   color_blue  color_green  color_red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

有序数据转换为数字:

data = {'con': ['old', 'new', 'new', 'renovated']}

使用映射后的数字格式:旧<装修<新→0、1、2

Numeric format after using mapping: Old < renovated < new → 0, 1, 2

0    0
1    2
2    2
3    1

在我的数据中,我具有颜色"功能.随着颜色从白色变为黑色,价格上涨.根据上述规则,我可能必须对分类的颜色"数据使用单热编码.但是为什么我不能使用序数表示.下面我提供了我提出问题的地方的看法.

In my data I have 'color' feature. As color changes from white to black price increases. From above mentioned rules I probably have to use one-hot encoding for categorical 'color' data. But why I cannot use ordinal representation. Below I provided my observations from where my question arised.

让我开始介绍线性回归公式:
让我们看看颜色的数据表示形式: 让我们使用两种数据表示的公式来预测第一和第二项目的价格:
一键编码: 在这种情况下,将存在不同颜色的不同theta.我假设thetas已经从回归中得出(20、50和100).预测将是:

Let me start with introducing formula for linear regression:
Let have a look at data representations for color: Let's predict price for 1-st and 2-nd item using formula for both data representations:
One-hot encoding: In this case different thetas for different colors will exist. I assume that thetas already derived from regression (20, 50 and 100). Prediction will be:

Price (1 item) = 0 + 20*1 + 50*0 + 100*0 = 20$  (thetas are assumed for example)
Price (2 item) = 0 + 20*0 + 50*1 + 100*0 = 50$  

颜色的常规编码: 在这种情况下,所有颜色都有1个共同的theta,但我分配的乘数(10、20、30)不同:

Ordinal encoding for color: In this case all colors will have 1 common theta but my assigned multipliers (10, 20, 30) differ:

Price (1 item) = 0 + 20*10 = 200$  (theta assumed for example)
Price (2 item) = 0 + 20*20 = 400$  (theta assumed for example)

在我的模特中,怀特<红色<价格黑.似乎相关性正确工作,并且在两种情况下都是合乎逻辑的预测.用于序数和分类表示.所以我可以使用任何编码进行回归,而与数据类型(分类或有序)无关吗?数据表示形式的这种划分仅仅是约定和面向软件的表示形式的问题,而不是回归逻辑本身的问题?

In my model White < Red < Black in prices. Seem to be that correlation works correctly and it is logical predictions in both cases. For ordinal and categorical representations. So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division in data representations is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?

推荐答案

因此,无论数据类型(分类或有序)如何,我都可以对回归使用任何编码?数据表示形式的这种划分仅仅是约定和面向软件的表示形式的问题,而不是回归逻辑本身的问题?

So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division in data representations is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?

可以执行任何操作.问题是什么可能会更好?答案是您应该使用能够嵌入有关数据结构的正确信息且不会嵌入错误假设的表示形式.这是什么意思?

You can do anything. The question is what will probably work better? And the answer is you should use representation which embeeds correct information about data structure and does not embeed false assumptions. What does it mean here?

  • 如果您的数据是分类数据,并且您使用数字格式,则会嵌入错误结构(因为没有分类数据的顺序)
  • 如果您的数据是原始数据,并且使用单one编码,则请勿嵌入真实结构(因为存在顺序,因此您可以忽略它).
  • If your data is categorical and you use number format you embed false structure (as there is no ordering of categorical data)
  • If your data is oridinal and you use one-hoe encoding you do not embed true structure (as there is an ordering and you ignore it).

那么,为什么两种格式在您的情况下都是有效的"呢?因为您的问题是微不足道的,实际上是错误地陈述的.您可以分析训练样本的预测效果,实际上,如果采用某种过拟合模型,无论采用哪种表示形式,您都将始终在训练数据上获得满分.实际上,您所做的表明存在使事情变得正确的theta .是的,如果存在theta(在线性模型中)适用于原始模型,那么总是存在一个theta-hot.问题是-训练模型时您很可能会错过它.这不是面向软件的问题,而是一个面向学习的问题.

So why does both format "work" in your case? Because your problem is trivial and in fact incorrectly stated. You analyze how well are predicted training samples and in fact, given some overfitting model you will always get perfect score on training data, no matter what representation is. In fact what you have done is show that there exists theta which makes thing right. And yes, if there exists theta (in linear models) which works for oridinal ones - there will always be one for the one-hot. The thing is - you will be much more likely to miss it while training your model. It is not software oriented problem, it is a learning oriented problem.

但是,实际上不会发生.一旦您引入了具有大量数据(可能是嘈杂的,不确定的等)的实际问题,就可以使用表示形式获得更好的分数,这种表示形式与问题的性质(此处为-原始)有关,花费更少的精力,然后使用不包含在内的表示形式(此处-一个热点).为什么?由于可以通过模型从数据中推断(学习)有关序数的知识,因此您将需要更多的训练数据.那么,如果可以将这些信息直接嵌入到数据结构中,从而导致更容易学习的问题,那又为什么呢?使用ML学习实际上很困难,不要更难了.另一方面,请始终记住,您必须确保,您所嵌入的知识确实是真实的,因为从数据中学习关系可能很困难,但是从中学习真实的模式则更加困难.错误的关系.

In practise, however, it would not happen. Once you would introduce actual problem, with lots of data, which might be noisy, uncertain etc. you would get better scores using representation which has something to do with nature of the problem (here - oridinal) with less effort then using representation which does not include it (here - one hot). Why? Because this knowledge of being ordinal can be infered (learned) from the data by the model, however you will need much more training data to do so. So why do this if you can embed this information directly into the data structure thus leading to easier learning problem? Learning in ML is actually hard, do not make it even harder. On the other hand always remember that you have to be sure that knowledge you embed is indeed true, because it might be hard to learn a relation from the data, but it is even harder to learn real patterns from false relations.

这篇关于回归分析中的分类和有序特征数据表示?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆