如何对某些字符串变量应用sklearn的线性回归 [英] How to apply linear regresssion of sklearn for some string variable

查看:384
本文介绍了如何对某些字符串变量应用sklearn的线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将使用逻辑回归来预测电影的票房. 我得到了一些包括演员和导演在内的火车资料.这是我的数据:

I am going to predict the box office of a movie using logistic regression. I got some train data including the actors and directors. This is my datas:

Director1|Actor1|300 million
Director2|Actor2|500 million

我将使用整数对导演和演员进行编码.

I am going to encode the directors and actors using integers.

1|1|300 million
2|2|300 million

这意味着X={[1,1],[2,2]} y=[300,500]fit(X,y) 这样行吗?

Which means that X={[1,1],[2,2]} y=[300,500]and fit(X,y) Does that work?

推荐答案

您不能在线性回归中使用分类变量.线性回归将所有变量视为数值变量.因此,如果将Director1编码为1,Director2编码为2,则线性回归将尝试基于该编码方案查找关系.假定Director2的大小是Director1的两倍.实际上,这些数字没有任何意义.您可以将它们编码为143和9879,应该没有任何区别.它们没有任何数字意义.为了确保线性回归正确对待它们,您需要使用虚拟变量.

You cannot use categorical variables in linear regression like that. Linear regression treats all variables like numerical variables. Therefore, if you code Director1 as 1 and Director2 as 2, linear regression will try to find a relationship based on that coding scheme. It will assume Director2 is twice the size of Director1. In reality, those numbers don't mean anything. You may code them as 143 and 9879, there shouldn't be any difference. They don't have any numerical meaning. In order to make sure linear regression treats them correctly, you need to use dummy variables.

使用伪变量,您为每个类别级别都有一个变量.例如,如果您有3个导演,则将有3个变量:D1,D2和D3.如果相应的电影是由Director1执导的,则D1的值为1,否则为0;如果电影是由Director2执导的,则D2的值为1,否则为0 ...因此,使用一组值D2 D1 D2 D3 D1 D2,您的虚拟变量将为:

With dummy variables, you have a variable for every category level. For example, if you have 3 directors, you will have 3 variables: D1, D2 and D3. D1 will have the value 1 if the corresponding movie was directed by Director1, and 0 otherwise; D2 will have the value 1 if the movie was directed by Director2, and 0 otherwise... So with a set of values D2 D1 D2 D3 D1 D2, your dummy varibles will be:

    D1 D2 D3
D2  0  1  0
D1  1  0  0
D2  0  1  0
D3  0  0  1
D1  1  0  0
D2  0  1  0

在线性回归中,为了避免多重共线性,我们仅使用这些变量中的n-1,其中n是类别数(此示例中的董事数).将选择其中一位董事作为基础,并在回归模型中用常数表示.哪一个都没关系.例如,如果排除D3,则如果D1=0D2=0,您将知道电影是由Director3执导的.您无需指定D3=1.

In linear regression, in order to avoid multicollinearity we use only n-1 of these variables where n is the number of categories (number of directors for this example). One of the directors will be selected as the base, and will be represented by the constant in the regression model. It doesn't matter which one. For example, if you exclude D3, you will know the movie was directed by Director3 if D1=0 and D2=0. You don't need to specify D3=1.

在scikit-learn中,此转换是通过 OneHotEncoder .该示例来自scikit-learn 文档:

In scikit-learn, this transformation is done with OneHotEncoder. The example is from scikit-learn documentation:

您具有三个分类变量:性别,地区和浏览器.性别有2个级别:["male", "female"],地区有3个级别:["from Europe", "from US", "from Asia"],浏览器有4个级别:["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"].假设它们使用从零开始的数字进行编码.因此,[0, 1, 2]是指来自美国的使用Safari的男性.

You have three categorical variables: Gender, Region and Browser. Gender has 2 levels: ["male", "female"], Region has three levels: ["from Europe", "from US", "from Asia"] and Browser has four levels: ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. Assume they are coded with zero-based numbers. So [0, 1, 2] means a male from US who uses Safari.

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

使用enc.fit

scikit-learn可以推断每个变量的级别数.对于[0,1,3]之类的观察,如果调用enc.transform,您将看到它们的虚拟变量.请注意,结果数组的长度为2 + 3 + 4 =9.前两个代表性别(​​如果是男性,第一个为1),接下来三个代表区域,依此类推.

With enc.fit scikit-learn infers the number of levels for each variable. For an observation like [0, 1, 3], if you call enc.transform you will see their dummy variables. Note that the resulting array's length is 2 + 3 + 4 = 9. The first two for gender (if male, the first one is 1), the next three for region, and so on.

这篇关于如何对某些字符串变量应用sklearn的线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆