Python pandas :如何使用“因素”来转换DataFrame成为线性回归的设计矩阵? [英] Python Pandas: how to turn a DataFrame with "factors" into a design matrix for linear regression?
问题描述
1 0 0
0 1 0
或
0 0 1
为了使用较低级别的回归码。有没有办法使用熊猫图书馆类似的东西?我看到在熊猫中有一些回归支持,但是由于我有自己的定制回归例程,我真的很感兴趣的是从异构数据构建设计矩阵(一个2d numpy数组或矩阵),支持映射回和堡numpy对象的列和派生的Pandas DataFrame。
更新:这是一个数据矩阵的例子,我正在考虑这种类型的异类数据(这个例子来自熊猫手册):
>>> df2 = DataFrame({'a':['one','one','two','three','two','one','six'],'b':['x','y ','y','x','y','x','x'],'c':np.random.randn(7)})
>>> df2
abc
0一个x 0.000343
1一个y -0.055651
2两个0.249194
3三个x -1.486462
4两个-0.406930
5 one x -0.223973
6 six x -0.189001
>>>>
'a'列应转换为4个浮点列(尽管有意义,只有四个独特的原子),'b'列可以转换为单个浮点列,'c'列应该是设计矩阵中未修改的最后一列。
谢谢,
SetJmp
有一个名为patsy的新模块可以解决这个问题。下面的quickstart解决了上面几行代码中的问题。
以下是一个示例用法:
import pandas
import patsy
dataFrame = pandas.io.parsers.read_csv(salary2.txt)
#salary2.txt是一个re格式化的数据集从教科书
#Introductory计量经济学:现代方法
#by Jeffrey Wooldridge
y,X = patsy.dmatrices(sl〜1 + sx + rk + yr + dg + yd,dataFrame)
#X.design_info提供了X列后面的元数据
print X.design_info
生成:
> DesignInfo(['Intercept',
>'sx [T.male]',
>'rk [T.associate]',
>'rk [T.full] ',
>'dg [T.masters]',
>'yr',
>'yd'],
> term_slices = OrderedDict([( Term([]),slice(0,1,None)),(Term([EvalFactor('sx')]),slice(1,2,None)),
> ('rk')]),slice(2,4,None)),
>(Term([EvalFactor('dg')]),slice(4,5,None)),
>(Term([EvalFactor('yr')]),slice(5,6,None)),
>(Term([EvalFactor('yd')]), ,无))]),
> builder =< patsy.build.DesignMatrixBuilder在0x10f169510>)
If memory servies me, in R there is a data type called factor which when used within a DataFrame can be automatically unpacked into the necessary columns of a regression design matrix. For example, a factor containing True/False/Maybe values would be transformed into:
1 0 0
0 1 0
or
0 0 1
for the purpose of using lower level regression code. Is there a way to achieve something similar using the pandas library? I see that there is some regression support within Pandas, but since I have my own customised regression routines I am really interested in the construction of the design matrix (a 2d numpy array or matrix) from heterogeneous data with support for mapping back and fort between columns of the numpy object and the Pandas DataFrame from which it is derived.
Update: Here is an example of a data matrix with heterogeneous data of the sort I am thinking of (the example comes from the Pandas manual):
>>> df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c' : np.random.randn(7)})
>>> df2
a b c
0 one x 0.000343
1 one y -0.055651
2 two y 0.249194
3 three x -1.486462
4 two y -0.406930
5 one x -0.223973
6 six x -0.189001
>>>
The 'a' column should be converted into 4 floating point columns (in spite of the meaning, there are only four unique atoms), the 'b' column can be converted to a single floating point column, and the 'c' column should be an unmodified final column in the design matrix.
Thanks,
SetJmp
There is a new module called patsy that solves this problem. The quickstart linked below solves exactly the problem described above in a couple lines of code.
Here is an example usage:
import pandas
import patsy
dataFrame = pandas.io.parsers.read_csv("salary2.txt")
#salary2.txt is a re-formatted data set from the textbook
#Introductory Econometrics: A Modern Approach
#by Jeffrey Wooldridge
y,X = patsy.dmatrices("sl ~ 1+sx+rk+yr+dg+yd",dataFrame)
#X.design_info provides the meta data behind the X columns
print X.design_info
generates:
> DesignInfo(['Intercept',
> 'sx[T.male]',
> 'rk[T.associate]',
> 'rk[T.full]',
> 'dg[T.masters]',
> 'yr',
> 'yd'],
> term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('sx')]), slice(1, 2, None)),
> (Term([EvalFactor('rk')]), slice(2, 4, None)),
> (Term([EvalFactor('dg')]), slice(4, 5, None)),
> (Term([EvalFactor('yr')]), slice(5, 6, None)),
> (Term([EvalFactor('yd')]), slice(6, 7, None))]),
> builder=<patsy.build.DesignMatrixBuilder at 0x10f169510>)
这篇关于Python pandas :如何使用“因素”来转换DataFrame成为线性回归的设计矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!