具有字符串/分类特征(变量)的线性回归分析? [英] Linear regression analysis with string/categorical features (variables)?

查看:199
本文介绍了具有字符串/分类特征(变量)的线性回归分析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

回归算法似乎正在处理以数字表示的特征. 例如:

Regression algorithms seem to be working on features represented as numbers. For example:

此数据集不包含分类特征/变量.很清楚如何对这些数据进行回归并预测价格.

This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.

但是现在我想对包含分类特征的数据进行回归分析:

But now I want to do a regression analysis on data that contain categorical features:

5 个功能:DistrictConditionMaterialSecurityType

如何对这些数据进行回归分析?我是否必须将所有字符串/分类数据手动转换为数字?我的意思是,如果我必须创建一些编码规则,并根据该规则将所有数据转换为数值.

How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.

是否有任何简单的方法可以将字符串数据转换为数字,而无需手动创建自己的编码规则?也许 Python 中有一些可用于此的库?是否存在由于编码错误"而导致回归模型以某种方式不正确的风险?

Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Python that can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?

推荐答案

是的,您必须将所有内容都转换为数字.那需要考虑这些属性代表什么.

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.

通常有三种可能性:

  1. 用于分类数据的一键编码
  2. 序数数据的任意数字
  3. 使用类似分组平均值的方式来分类数据(例如,市区的平均价格).

您必须谨慎,不要注入应用案例中没有的信息.

You have to be carefull to not infuse information you do not have in the application case.

如果具有分类数据,则可以为每个可能的值创建具有0/1值的虚拟变量.

If you have categorical data, you can create dummy variables with 0/1 values for each possible value.

E. g.

idx color
0   blue
1   green
2   green
3   red

idx blue green red
0   1    0     0
1   0    1     0
2   0    1     0
3   0    0     1

这可以很容易地通过熊猫来完成:

This can easily be done with pandas:

import pandas as pd

data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))

将导致:

   color_blue  color_green  color_red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

序数编号

创建可分类类别的映射,例如G. 旧<装修<新→0、1、2

Numbers for ordinal data

Create a mapping of your sortable categories, e. g. old < renovated < new → 0, 1, 2

这对于大熊猫也是可能的:

This is also possible with pandas:

data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})
data['q'] = data['q'].astype('category')
data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)
data['q'] = data['q'].cat.codes
print(data['q'])

结果:

0    0
1    2
2    2
3    1
Name: q, dtype: int8

使用分类数据进行分组操作

您可以使用过去(已知事件)中每个类别的平均值.

Using categorical data for groupby operations

You could use the mean for each category over past (known events).

假设您有一个DataFrame,其最后一个已知的城市平均价格:

Say you have a DataFrame with the last known mean prices for cities:

prices = pd.DataFrame({
    'city': ['A', 'A', 'A', 'B', 'B', 'C'],
    'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})

print(data.merge(mean_price, on='city', how='left'))

结果:

  city  price
0    A      1
1    B      2
2    C      3
3    A      1
4    B      2
5    A      1

这篇关于具有字符串/分类特征(变量)的线性回归分析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆