Python pandas :如何使用“因素”来转换DataFrame成为线性回归的设计矩阵? [英] Python Pandas: how to turn a DataFrame with "factors" into a design matrix for linear regression?

查看:157
本文介绍了Python pandas :如何使用“因素”来转换DataFrame成为线性回归的设计矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果存储器服务于我,在R中有一个称为因子的数据类型,当在DataFrame中使用时,它可以自动解压缩到回归设计矩阵的必要列中。例如,包含True / False / Maybe值的因子将被转换为:

  1 0 0 
0 1 0

0 0 1

为了使用较低级别的回归码。有没有办法使用熊猫图书馆类似的东西?我看到在熊猫中有一些回归支持,但是由于我有自己的定制回归例程,我真的很感兴趣的是从异构数据构建设计矩阵(一个2d numpy数组或矩阵),支持映射回和堡numpy对象的列和派生的Pandas DataFrame。



更新:这是一个数据矩阵的例子,我正在考虑这种类型的异类数据(这个例子来自熊猫手册):

 >>> df2 = DataFrame({'a':['one','one','two','three','two','one','six'],'b':['x','y ','y','x','y','x','x'],'c':np.random.randn(7)})
>>> df2
abc
0一个x 0.000343
1一个y -0.055651
2两个0.249194
3三个x -1.486462
4两个-0.406930
5 one x -0.223973
6 six x -0.189001
>>>>

'a'列应转换为4个浮点列(尽管有意义,只有四个独特的原子),'b'列可以转换为单个浮点列,'c'列应该是设计矩阵中未修改的最后一列。



谢谢,



SetJmp

解决方案

有一个名为patsy的新模块可以解决这个问题。下面的quickstart解决了上面几行代码中的问题。





以下是一个示例用法:

  import pandas 
import patsy

dataFrame = pandas.io.parsers.read_csv(salary2.txt)
#salary2.txt是一个re格式化的数据集从教科书
#Introductory计量经济学:现代方法
#by Jeffrey Wooldridge
y,X = patsy.dmatrices(sl〜1 + sx + rk + yr + dg + yd,dataFrame)
#X.design_info提供了X列后面的元数据
print X.design_info

生成:

 > DesignInfo(['Intercept',
>'sx [T.male]',
>'rk [T.associate]',
>'rk [T.full] ',
>'dg [T.masters]',
>'yr',
>'yd'],
> term_slices = OrderedDict([( Term([]),slice(0,1,None)),(Term([EvalFactor('sx')]),slice(1,2,None)),
> ('rk')]),slice(2,4,None)),
>(Term([EvalFactor('dg')]),slice(4,5,None)),
>(Term([EvalFactor('yr')]),slice(5,6,None)),
>(Term([EvalFactor('yd')]), ,无))]),
> builder =< patsy.build.DesignMatrixBuilder在0x10f169510>)


If memory servies me, in R there is a data type called factor which when used within a DataFrame can be automatically unpacked into the necessary columns of a regression design matrix. For example, a factor containing True/False/Maybe values would be transformed into:

1 0 0
0 1 0
or
0 0 1

for the purpose of using lower level regression code. Is there a way to achieve something similar using the pandas library? I see that there is some regression support within Pandas, but since I have my own customised regression routines I am really interested in the construction of the design matrix (a 2d numpy array or matrix) from heterogeneous data with support for mapping back and fort between columns of the numpy object and the Pandas DataFrame from which it is derived.

Update: Here is an example of a data matrix with heterogeneous data of the sort I am thinking of (the example comes from the Pandas manual):

>>> df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c' : np.random.randn(7)})
>>> df2
       a  b         c
0    one  x  0.000343
1    one  y -0.055651
2    two  y  0.249194
3  three  x -1.486462
4    two  y -0.406930
5    one  x -0.223973
6    six  x -0.189001
>>> 

The 'a' column should be converted into 4 floating point columns (in spite of the meaning, there are only four unique atoms), the 'b' column can be converted to a single floating point column, and the 'c' column should be an unmodified final column in the design matrix.

Thanks,

SetJmp

解决方案

There is a new module called patsy that solves this problem. The quickstart linked below solves exactly the problem described above in a couple lines of code.

Here is an example usage:

import pandas
import patsy

dataFrame = pandas.io.parsers.read_csv("salary2.txt") 
#salary2.txt is a re-formatted data set from the textbook
#Introductory Econometrics: A Modern Approach
#by Jeffrey Wooldridge
y,X = patsy.dmatrices("sl ~ 1+sx+rk+yr+dg+yd",dataFrame)
#X.design_info provides the meta data behind the X columns
print X.design_info

generates:

> DesignInfo(['Intercept',
>             'sx[T.male]',
>             'rk[T.associate]',
>             'rk[T.full]',
>             'dg[T.masters]',
>             'yr',
>             'yd'],
>            term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('sx')]), slice(1, 2, None)),
> (Term([EvalFactor('rk')]), slice(2, 4, None)),
> (Term([EvalFactor('dg')]), slice(4, 5, None)),
> (Term([EvalFactor('yr')]), slice(5, 6, None)),
> (Term([EvalFactor('yd')]), slice(6, 7, None))]),
>            builder=<patsy.build.DesignMatrixBuilder at 0x10f169510>)

这篇关于Python pandas :如何使用“因素”来转换DataFrame成为线性回归的设计矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆