从一个数据帧到另一数据帧的条件随机匹配 [英] conditional random matching from one data frame into another data frame

查看:77
本文介绍了从一个数据帧到另一数据帧的条件随机匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧.一个数据框(Partners.Missing)包含195个需要配对的人(已婚,事实上等),我需要使用第二个数据框(NAsOnly)中的随机选择来为其构建伙伴.

I have two data frames. One data frame (Partners.Missing) contains 195 people who are partnered (married, de facto, etc) for which I need to construct the partner, using a random selection from a second data frame (NAsOnly).

Partners.Missing数据帧信息为:

 str(Partners.Missing)
 'data.frame':  195 obs. of  8 variables:
  $ V1         : Factor w/ 2 levels "Female","Male": 1 1 1 2 1 1 1 2 2 2 ...
  $ V2         : Factor w/ 9 levels "15 - 17 Years",..: 4 4 7 7 4 4 7 3 7 4 ...
  $ V3         : Factor w/ 1 level "Partnered": 1 1 1 1 1 1 1 1 1 1 ...
  $ V4         : Factor w/ 7 levels "Eight or More Usual Residents",..: 1 1 5 2 1 1 1 1 2 5 ...
  $ V5         : Factor w/ 8 levels "1-9 Hours Worked",..: 8 4 8 6 7 8 7 5 4 6 ...
  $ SEX        : chr  "Male" "Male" "Male" "Female" ...
  $ Ageband    : num  4 4 7 7 4 4 7 3 7 4 ...
  $ Inhabitants: num  8 8 6 5 8 8 8 8 5 6 ...

由于V2是年龄段因素,因此我创建了Ageband变量,该变量是V2的重新编码,因此最年轻的年龄段(15-17岁)是1,下一个年龄最大的年龄段是2,等等.InhabitantsV4的重新编码,再次构造了一个数字变量. Sex是二进制男"/女".

Because V2 is age-band as a factor, I have created the Ageband variable that is a recode of V2 so that the youngest age group (15 - 17 years) is 1, the next oldest is 2, etc. Inhabitants is a recode of V4, again to construct a numeric variable. Sex is binary "Male"/"Female".

有关第二个数据帧(NAsOnly)的信息是:

The information on the second data frame (NAsOnly) is:

 str(NAsOnly)
 'data.frame':  762 obs. of  7 variables:
  $ SEX         : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
  $ AGEBAND     : Factor w/ 13 levels "0 - 4 Years",..: 3 3 3 3 3 3 3 3 3 3 ...
  $ RELATIONSHIP: Factor w/ 4 levels "Non-partnered",..: 3 3 3 3 1 1 1 1 1 1 ...
  $ INHABITANTS : Factor w/ 9 levels "Eight or More Usual Residents",..: 7 7 3 2 9 9 9 9 7 7 ...
  $ HRSWORKED   : Factor w/ 9 levels "1-9 Hours Worked",..: 1 8 6 3 1 2 3 6 3 4 ...

我可以创建新变量,以使NAsOnly中的AgebandInhabitants具有相同的构造,以用于匹配.但是我坚持如何搭配.对于Partners.Missing中的每一行,我想做的是使用以下条件从NAsOnly中随机抽取观察值:

I can create new variables so that Ageband and Inhabitants in NAsOnly are the same construction, to use in matching. But I'm stuck on how to match. What I want to do - for each row in Partners.Missing - is to randomly sample an observation from NAsOnly using the following criteria:

  • SEX相反(因此,Partners.Missing中的女性"将与NAsOnly中的男性"匹配)
  • 女性"伴侣(无论他们来自哪个数据框)与男性"伴侣处于同一年龄段或比其年轻一岁
  • Inhabitants的数目完全匹配,因此5人家庭的女性"只会与5人家庭的男性"(正确年龄段)匹配
  • NAsOnly中的
  • RELATIONSHIP只能是伙伴"(非伙伴"和未包括在其他地方"也是该数据框中的有效变量条目)*.
  • opposite SEX (so a "Female" in Partners.Missing will match to a "Male" in NAsOnly)
  • the "Female" partner (irrespective of which data frame they originate) is in the same age band, or one younger, than the "Male" partner
  • the number of Inhabitants is an exact match, so that a "Female" from a 5-person household will only match to a "Male" (of the correct age band) from a 5-person household
  • RELATIONSHIP in NAsOnly can only be "Partnered" ("Non-partnered" and "Not elsewhere included" are also valid variable entries in that data frame)*.

所以我想要一对一的比赛,我需要比赛是随机抽签,而不是第一个可用.并进行195次,每次Partners.Missing中的观察一次,以使他们的伴侣不再失踪.

So I want a one-to-one match, and I need the match to be a random draw and not the first available. And do this 195 times, once for each observation in Partners.Missing so that their partner is no longer missing.

我也不能使用首个或最后一个匹配,因为NAsOnly中可能有很多行根据我的标准进行匹配-必须是随机抽取,否则每次都会抽取相同的观察结果来自NAsOnly.基本上,类似于从NAsOnly进行替换的随机采样.使用采样的观察值构造匹配的第三个数据帧,还是将采样的观察值添加到Partners.Missing作为附加列都没有关系.

I can't use first or last match either, as there could be numerous rows in NAsOnly that match on the basis of my criteria - it has to be a random draw, otherwise the same observations will be draw every time from NAsOnly. Basically, something like random sampling with replacement from NAsOnly. It does not matter whether the sampled observations are used to contruct a third data frame of matches, or whether the sampled observations are added to Partners.Missing as additional columns.

*它具有四个级别,因为原始的较大数据框具有总计"行,因此第四个(未使用)级别是总计".

*It has four levels as the original larger data frame had Totals rows, so the fourth (and unused) level is "Total".

更新: 我试图编写一个for next循环来执行此操作,但是它没有按预期工作.代码是:

Update: I have tried to write a for next loop to do this, but it's not working as intended. The code is:

 for(i in 1:1) {
   row <- Partners.Missing[i,]
   if(row$V1=="Female")
   matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
             row$Inhabitants[i]==Partnered.Censored$Inhabitants &
           (row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband+1)
   )
   else
   matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
           row$Inhabitants[i]==Partnered.Censored$Inhabitants &
           (row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband-1)
   )
 }

这会将单列输出到名为matcheddata frame中,其中TRUEFALSE作为277行的单列中的输入,表示Partnered.Censored中该行的索引是否匹配.一旦我将最大值增加到2(知道我有195行),我将得到NA作为输出.我还有以下问题:

This outputs a single column into a data frame called matched with TRUE or FALSE as the input in a single column of 277 rows, representing whether that row's index in Partnered.Censored is a match or not. Once I increase i's maximum value to 2 (knowing I have 195 rows), I get NA as output. I have the following problems remaining:

  • 我希望使用与Partnered.Censored相匹配的行,而不是输出布尔结果
  • 然后我希望从匹配的行中随机抽样以生成新​​的合作伙伴
  • ,然后对Partners.Missing中的每一行重复一次.
  • I wish to use the row(s) that matches from Partnered.Censored rather than outputting a boolean result
  • I then wish to sample randomly from the matching rows to generate the new partner
  • and then repeat for each row in Partners.Missing.

我也有一个问题,例如增加i的最大值.到2,将覆盖TRUE/ FALSE values with NA`的单列.

I also have the problem where increasing the maximum value of i, e.g. to 2, overwrites the single column of TRUE/FALSEvalues withNA`.

推荐答案

在过去的几天里,这一直是我的首要任务,而且我似乎已经使用以下代码解决了该问题.我留下问题并回答,以防万一其他人需要这样做.

This has been top of my mind for the past couple of days, and I appear to have solved the problem using the following code. I'm leaving the question and answer up just in case anyone else needs to do this.

 for(i in 1:nrow(Partners.Missing)) {
   row <- Partners.Missing[i,]
   result <- merge(row, Partnered.Censored, by=c("SEX","Inhabitants"),suffixes=c(".r",".c"))
   if (row$V1=="Female") {
     result<- subset(result, Ageband.r==Ageband.c | Ageband.r==Ageband.c-1)
   }
   if (row$V1=="Male") {
    result<- subset(result, Ageband.r==Ageband.c | Ageband.r==Ageband.c+1)
   }
   j <- sample(1:nrow(result),1)
   if(i == 1) {
     Matched.Partners <- result[j,]
   }
   if (i > 1) {
   Matched.Partners <- rbind(Matched.Partners,result[j,])
   }
 }

也向需要此答案的任何人解释此代码,并查看社区是否有更好的答案, 对于Partners.Missing中的每个人,都会创建一个临时矢量来保存该人的信息.一对多联接是基于两个匹配的变量构建的:失踪者的性别和家庭中的居民人数.然后,根据Partners.Missing中的人是女性还是男性,匹配的结果仅保留给具有正确年龄段的潜在伴侣.然后,该代码查找已标识的潜在合作伙伴的数量,并生成一个介于1和该数量之间的随机整数.这用于提取随机匹配的人并将其放入输出数据帧.因为运行此代码之前不存在输出数据帧(Matched.Partners),所以第一个循环使用其第一行创建数据帧.每隔两次,数据框已经存在,因此将添加新的匹配项.

Explaining this code to anyone that needs this answer too, and also to see if the community has a better answer, For each person in Partners.Missing a temporary vector is created holding that person's information. A one-to-many join is constructed on the basis of the two variables that will match - the missing person's sex, and the number of inhabitants in the household. Then, depending on whether the person in Partners.Missing is female or male, the matched results are only retained for potential partners with the correct age band. The code then locates the number of potential partners identified, and generates a random integer between 1 and that number. This is used to extract the randomly matched person and put them into the output data frame. Because the output data frame (Matched.Partners) does not exist before this code is run, the first loop creates the data frame with its first row. Every other time through, the data frame already exists, so the new match is appended.

我不会投票赞成我的问题或答案.

I'll not vote up either my question or my answer.

这篇关于从一个数据帧到另一数据帧的条件随机匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆