Python + Pandas 的差异 [英] Difference in Differences in Python + Pandas

查看:42
本文介绍了Python + Pandas 的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行

即,我正在处理一个多变量模型.

这里是一个简单的 R 示例:

(一位经济学家)告诉我,这似乎没有固定效应.

--编辑--

我想验证的是,在给定时间的情况下,许可数量对分数的影响.许可证的数量就是治疗,这是一种强化治疗.

可以在此处找到代码示例:https://www.dropbox.com/sh/ped312ur604357r/AACQGloHDAy8I2C6HITFzjqza?dl=0.

解决方案

看来您需要的不是差异 (DD) 回归中的差异.当您可以区分对照组和治疗组时,DD 回归是相关的.一个标准的简化示例是对药物的评估.你把一群病人分成两组.他们中的一半什么也没得到:他们是对照组.另一半被给予药物:他们是治疗组.从本质上讲,DD 回归将捕捉到这样一个事实,即药物的实际效果无法直接通过服用药物的人数得到健康来衡量.直觉上,你想知道这些人是否比没有服用任何药物的人做得更好.可以通过添加另一个类别来完善这个结果:安慰剂类,即那些被给予看起来像药物但实际上不是......但同样这将是一个明确定义的群体.最后但并非最不重要的一点是,要使 DD 回归真正合适,您需要确保组的异质性不会导致结果产生偏差.药物测试的一个糟糕情况是,如果治疗组仅包括年轻且超级健康的人(因此通常更有可能治愈),而对照组则是一群年老的酗酒者......

就你而言,如果我没记错的话,每个人都会得到治疗";在某种程度上......所以你更接近于一个标准的回归框架,其中要衡量 X 对 Y 的影响(例如智商对工资).我知道您想衡量许可证数量对分数的影响(或者是其他方式?-_-),并且您需要处理经典的内生性,即如果彼得比保罗更熟练,他会通常获得更多的许可和更高的分数.所以你真正想要使用的是这样一个事实,随着时间的推移,随着时间的推移,以相同水平的技能,彼得(分别是保罗)将被给予"多年来获得不同级别的许可……在那里您将真正衡量许可对分数的影响……

我可能猜得不好,但我想坚持一个事实,即如果您不付出足够的努力来理解/解释数据中发生的事情,则有很多方法可以获得有偏见的结果,因此结果毫无意义.关于技术细节,您的估计只有年份固定效果(可能未估计但通过贬低考虑在内,因此未在输出中返回),因此您要做的是添加 entity_effects = True.如果你想更进一步......恐怕到目前为止,任何 Python 包都没有很好地涵盖面板数据回归(包括 statsmodels,如果你不想投资的话......我宁愿建议使用 R 或 Stata.同时,如果您只需要固定效应回归,您也可以使用 statsmodels(如果需要,它还允许对标准误差进行聚类......):

导入 statsmodels.formula.api 作为 smfdf = s.reset_index(drop = False)reg = smf.ols('y ~ x + C(date) + C(id)',数据 = df).fit()打印(reg.summary())# 在个体层面聚类标准误差reg_cl = smf.ols(formula='y ~ x + C(date) + C(id)',数据=df).fit(cov_type='cluster',cov_kwds={'groups': df['id']})打印(reg_cl.summary())# 只输出 x 的系数和标准误差打印(u'{:.3f} ({:.3f})'.format(reg.params.ix['x'], reg.bse.ix['x']))打印(u'{:.3f} ({:.3f})'.format(reg_cl.params.ix['x'], reg_cl.bse.ix['x']))

关于计量经济学,您可能会在 Cross Validated 上获得比此处更多/更好的答案.

I'm trying to perform a Difference in Differences (with panel data and fixed effects) analysis using Python and Pandas. I have no background in Economics and I'm just trying to filter the data and run the method that I was told to. However, as far as I could learn, I understood that the basic diff-in-diffs model looks like this:

I.e., I am dealing with a multivariable model.

Here it follows a simple example in R:

https://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/

As it can be seen, the regression takes as input one dependent variable and tree sets of observations.

My input data looks like this:

    Name    Permits_13  Score_13    Permits_14  Score_14    Permits_15  Score_15
0   P.S. 015 ROBERTO CLEMENTE   12.0    284 22  279 32  283
1   P.S. 019 ASHER LEVY 18.0    296 51  301 55  308
2   P.S. 020 ANNA SILVER    9.0 294 9   290 10  293
3   P.S. 034 FRANKLIN D. ROOSEVELT  3.0 294 4   292 1   296
4   P.S. 064 ROBERT SIMON   3.0 287 15  288 17  291
5   P.S. 110 FLORENCE NIGHTINGALE   0.0 313 3   306 4   308
6   P.S. 134 HENRIETTA SZOLD    4.0 290 12  292 17  288
7   P.S. 137 JOHN L. BERNSTEIN  4.0 276 12  273 17  274
8   P.S. 140 NATHAN STRAUS  13.0    282 37  284 59  284
9   P.S. 142 AMALIA CASTRO  7.0 290 15  285 25  284
10  P.S. 184M SHUANG WEN    5.0 327 12  327 9   327

Through some research I found that this is the way to use fixed effects and panel data with Pandas:

Fixed effect in Pandas or Statsmodels

I performed some transformations to get a Multi-index data:

rng = pandas.date_range(start=pandas.datetime(2013, 1, 1), periods=3, freq='A')
index = pandas.MultiIndex.from_product([rng, df['Name']], names=['date', 'id'])
d1 = numpy.array(df.ix[:, ['Permits_13', 'Score_13']])
d2 = numpy.array(df.ix[:, ['Permits_14', 'Score_14']])
d3 = numpy.array(df.ix[:, ['Permits_15', 'Score_15']])
data = numpy.concatenate((d1, d2, d3), axis=0)
s = pandas.DataFrame(data, index=index)  
s = s.astype('float')

However, I didn't get how to pass all this variables to the model, such as can be done in R:

reg1 = lm(work ~ post93 + anykids + p93kids.interaction, data = etc)

Here, 13, 14, 15 represents data for 2013, 2014, 2015, which I believe should be used to create a panel. I called the model like this:

reg  = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)

And this is the result:

I was told (by an economist) that this doesn't seem to be running with fixed effects.

--EDIT--

What I want to verify is the effects of the number of permits on the score, given the time. The number of the permits is the treatment, it's an intensive treatment.

A sample of the code can be found here: https://www.dropbox.com/sh/ped312ur604357r/AACQGloHDAy8I2C6HITFzjqza?dl=0.

解决方案

It seems that what you need are not difference in differences (DD) regressions. DD regressions are relevant when you can distinguish a control group and a treatment group. A standard simplified example would be the evaluation of a medicine. You split a population of sick people in two groups. Half of them are given nothing: they are the control group. The other half are given a medicine: they are the treatment group. Essentially, the DD regression will capture the fact that the real effect of the medicine is not directly measurable in terms of how many people who were given the medicine got healthy. Intuitively, you want to know if these people did better than the ones who were not given any medicine. This result could be refined by adding yet another category: a placebo one i.e. people who are given something which looks like a medicine but actually isn't... but again this would be a well defined group. Last but not least, for a DD regression to be really appropriate, you need to make sure groups are not heterogeneous in a way that could bias results. A bad situation for your medicine test would be if the treatment group includes only people who are young and super fit (hence more likely to heal in general), while the control group is a bunch of old alcoholics...

In your case, if I'm not mistaken, everybody gets "treated" to some extent... so you are closer to a standard regression framework where the impact of X on Y (e.g. IQ on wage) is to be measured. I understand that you want to measure the impact of the number of permits on the score (or is it the other way? -_-), and you have classical endogeneity to deal with i.e. if Peter is more skilled than Paul, he'll typically obtain more permits AND a higher score. So what you actually want to use is the fact that with the same level of skill over time, Peter (respectively Paul) will be "given" different levels of permits over years... and there you'll really measure the influence of permits on score...

I might not be guessing well, but I want to insist on the fact that there are many ways to obtain biased, hence meaningless results, if you don't put enough efforts to understand/explain what's going on in the data. Regarding technical details, your estimation only have year fixed effects (likely not estimated but taken into account through demeaning, hence not returned in the output), so what you want to do is to add entity_effects = True. If you want to go further... I'm afraid panel data regressions are not well covered in any Python package so far, (including statsmodels which if the reference for econometrics) so if you're not willing to invest... I would rather suggest using R or Stata. Meanwhile, if a Fixed Effect regression is all you need, you can also get it with statsmodels (which also allows to cluster standard errors if needed...):

import statsmodels.formula.api as smf
df = s.reset_index(drop = False)
reg = smf.ols('y ~ x + C(date) + C(id)',
              data = df).fit()
print(reg.summary())
# clustering standard errors at individual level
reg_cl = smf.ols(formula='y ~ x + C(date) + C(id)',
                 data=df).fit(cov_type='cluster',
                              cov_kwds={'groups': df['id']})
print(reg_cl.summary())
# output only coeff and standard error of x
print(u'{:.3f} ({:.3f})'.format(reg.params.ix['x'], reg.bse.ix['x']))
print(u'{:.3f} ({:.3f})'.format(reg_cl.params.ix['x'], reg_cl.bse.ix['x']))

Regarding econometrics, you'll likely get more/better answers on Cross Validated than here.

这篇关于Python + Pandas 的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆