在 Stata 中进行加权热卡插补的简单方法? [英] Simple way to do a weighted hot deck imputation in Stata?

查看:13
本文介绍了在 Stata 中进行加权热卡插补的简单方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Stata 中做一个简单的加权热卡插补.在 SAS 中,等效命令如下(请注意,这是一个较新的 SAS 功能,从 2015 年左右的 SAS/STAT 14.1 开始):

I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):

proc surveyimpute method=hotdeck(selection=weighted); 

为了清楚起见,基本要求是:

For clarity then, the basic requirements are:

  1. 插补大多是基于行的或同时的.如果第 1 行向第 3 行捐赠了 x,那么它也必须捐赠 y.

  1. Imputations most be row-based or simultaneous. If row 1 donates x to row 3, then it must also donate y.

必须考虑重量.体重=2的供体被选中的可能性应该是体重=1的供体的两倍

Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1

我假设丢失的数据是矩形的.换句话说,如果一组可能丢失的变量由 xy 组成,那么要么两者都丢失,要么都不丢失.下面是一些生成示例数据的代码.

I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x and y then either both are missing or neither is missing. Here's some code to generate sample data.

global miss_vars "wealth income"
global weight    "weight"

set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0

// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars {
    replace `v' = . if impute == 1
}

数据如下所示:

            id       type     income     wealth     weight     impute
  1.         1          0       5000   20188.03          4          0
  2.         2          0      10000   40288.81          1          0
  3.         3          0          .          .          1          1
  4.         4          1      20000   80350.85          4          0
  5.         5          1      25000   100378.8          1          0
  6.         6          1          .          .          1          1

因此,换句话说,我们需要随机(带权重)为每个缺失值的行选择一个相同类型观察值的捐赠者,并使用该捐赠者来填写收入和财富值.在实际使用中,类型变量的生成当然是它自己的问题,但我在这里保持非常简单,专注于主要问题.

So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.

例如,第 3 行可能看起来像以下任何一个 post hotdeck(因为它从第 1 行或第 2 行填充收入和财富(但相反,从不会从第 1 行获取收入和从第 2 行获取财富)):

For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):

  3.         3          0       5000   20188.03          1          1
  3.         3          0      10000   40288.81          1          1

此外,由于第 1 行的权重=4,第 2 行的权重=1,因此第 1 行应该在 80% 的时间里是供体,而第 2 行应该在 20% 的时间里是供体.

Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.

推荐答案

似乎在 Stata 中没有办法做到这一点,也没有社区贡献的命令.有一些社区贡献的命令可以处理 hotdeck(具体来说,hotdeck、whotdeck、hotdeckvar),但它们都没有处理样本权重.whotdeck 命令表面上似乎处理权重,但这些不是样本权重,而是内部估计的重要性权重.

It appears there was no way to do this in Stata nor were there community-contributed commands either. There were community-contributed commands that did hotdecks (specifically, hotdeck, whotdeck, and hotdeckvar) but none of them handled sample weights. The whotdeck command superficially appeared to handle weights, but these are not sample weights but rather internally estimated importance weights.

因此我自己编写了一个程序并上传到 github.它被称为 wtd_hotdeck.请点击该链接了解更多信息和任何后续更新.

Hence I wrote a program myself and uploaded to github. It is called wtd_hotdeck. Please follow that link for more information and any subsequent updates.

这篇关于在 Stata 中进行加权热卡插补的简单方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆