条件键联接/更新_and_更新标志列以进行匹配 [英] Conditional keyed join/update _and_ update a flag column for matches

查看:61
本文介绍了条件键联接/更新_and_更新标志列以进行匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这非常类似于问题 @DavidArenburg询问了条件键联接,还有一个我似乎无法解决的额外错误。

This is very similar to the question @DavidArenburg asked about conditional keyed joins, with an additional bugbear that I can't seem to suss out.

除了条件条件外,基本上联接,我想定义一个标志,说明匹配发生在匹配过程的哪一步;我的问题是我只能获得为所有值定义的标志,而不能为匹配的值定义。

Basically, in addition to a conditional join, I want to define a flag saying at which step of the matching process that the match occurred; my problem is that I can only get the flag to define for all values, not the matched values.

在这里,我希望这是最小的工作示例:

Here's what I hope is a minimal working example:

DT = data.table(
  name = c("Joe", "Joe", "Jim", "Carol", "Joe",
           "Carol", "Ann", "Ann", "Beth", "Joe", "Joe"),
  surname = c("Smith", "Smith", "Jones",
              "Clymer", "Smith", "Klein", "Cotter",
              "Cotter", "Brown", "Smith", "Smith"),
  maiden_name = c("", "", "", "", "", "Clymer",
                  "", "", "", "", ""),
  id = c(1, 1:3, rep(NA, 7)),
  year = rep(1:4, c(4, 3, 2, 2)),
  flag1 = NA, flag2 = NA, key = "year"
)

DT
#      name surname maiden_name id year flag1 flag2
#  1:   Joe   Smith              1    1 FALSE FALSE
#  2:   Joe   Smith              1    1 FALSE FALSE
#  3:   Jim   Jones              2    1 FALSE FALSE
#  4: Carol  Clymer              3    1 FALSE FALSE
#  5:   Joe   Smith             NA    2 FALSE FALSE
#  6: Carol   Klein      Clymer NA    2 FALSE FALSE
#  7:   Ann  Cotter             NA    2 FALSE FALSE
#  8:   Ann  Cotter             NA    3 FALSE FALSE
#  9:  Beth   Brown             NA    3 FALSE FALSE
# 10:   Joe   Smith             NA    4 FALSE FALSE
# 11:   Joe   Smith             NA    4 FALSE FALSE

我的方法是每年首先尝试匹配上一年的名字/姓氏;如果失败,则尝试匹配名字/姓氏。我想定义 flag1 表示完全匹配,定义 flag2 表示婚姻。

My approach is, for each year, to first try and match on first name/last name from a prior year; if that fails, then try to match on first name/maiden name. I want to define flag1 to denote an exact match and flag2 to denote a marriage.

for (yr in 2:4) {

  #which ids have we hit so far?
  existing_ids = DT[.(yr), unique(id)]

  #find people in prior years appearing to
  #  correspond to those people
  unmatched = 
    DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N], by = id]
  setkey(unmatched, name, surname)

  #merge a la Arun, define flag1
  setkey(DT, name, surname)
  DT[year == yr, c("id", "flag1") := unmatched[.SD, .(id, TRUE)]]
  setkey(DT, year)

  #repeat, this time keying on name/maiden_name
  existing_ids = DT[.(yr), unique(id)]
  unmatched = 
    DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N],by=id]
  setkey(unmatched, name, surname)

  #now define flag2 = TRUE
  setkey(DT, name, maiden_name)
  DT[year==yr & is.na(id), c("id", "flag2") := unmatched[.SD, .(id, TRUE)]]
  setkey(DT, year)

  #this is messy, but I'm trying to increment id
  #  for "new" individuals
  setkey(DT, name, surname, maiden_name)
  DT[year == yr & is.na(id),
     id := unique(
       DT[year == yr & is.na(id)], 
       by = c("name", "surname", "maiden_name")
     )[ , count := .I][.SD, count] + DT[ , max(id, na.rm = TRUE)]
     ]

  #re-sort by year at the end    
  setkey(DT, year)    
}

我希望通过在我定义<$时在 j 参数中包含 TRUE 值c $ c> id ,只有匹配的 name s(例如,第一步的Joe)才具有标志更新为 TRUE ,但事实并非如此-它们都已更新:

I was hoping that by including the TRUE value in the j argument while I define id, only the matched names (e.g., Joe at the first step) would have their flag updated to TRUE, but this isn't the case--they are all updated:

DT[]
#      name surname maiden_name id year flag1 flag2
#  1: Carol  Clymer              3    1 FALSE FALSE
#  2:   Jim   Jones              2    1 FALSE FALSE
#  3:   Joe   Smith              1    1 FALSE FALSE
#  4:   Joe   Smith              1    1 FALSE FALSE
#  5:   Ann  Cotter              4    2  TRUE  TRUE
#  6: Carol   Klein      Clymer  3    2  TRUE  TRUE
#  7:   Joe   Smith              1    2  TRUE FALSE
#  8:   Ann  Cotter              4    3  TRUE FALSE
#  9:  Beth   Brown              5    3  TRUE  TRUE
# 10:   Joe   Smith              1    4  TRUE FALSE
# 11:   Joe   Smith              1    4  TRUE FALSE

有什么方法可以只更新匹配行的 flag 值吗?理想的输出如下:

Is there any way to update only the matched rows' flag values? Ideal output is as follows:

DT[]
#      name surname maiden_name id year flag1 flag2
#  1: Carol  Clymer              3    1 FALSE FALSE
#  2:   Jim   Jones              2    1 FALSE FALSE
#  3:   Joe   Smith              1    1 FALSE FALSE
#  4:   Joe   Smith              1    1 FALSE FALSE
#  5:   Ann  Cotter              4    2 FALSE FALSE
#  6: Carol   Klein      Clymer  3    2 FALSE  TRUE
#  7:   Joe   Smith              1    2  TRUE FALSE
#  8:   Ann  Cotter              4    3  TRUE FALSE
#  9:  Beth   Brown              5    3 FALSE FALSE
# 10:   Joe   Smith              1    4  TRUE FALSE
# 11:   Joe   Smith              1    4  TRUE FALSE


推荐答案

关键(没有双关语)我认为是要意识到合并返回了 NA 表示丢失的ID,因此我应该在每个字段中将 flag 添加到 unmatched 步骤,例如,在第1步:

The key (no pun intended) I think was to realize that the merge was returning NA for the missed IDs, so I should add the flag to unmatched at each step, e.g., at step 1:

unmatched <- dt[.(1:(yr - 1L))
                ][!id %in% existing_ids,
                  .SD[.N], by = id][ , flag1 := TRUE]
dt[year == yr, c("id", "flag1") := 
     unmatched[.SD, .(id, flag1), on = "name,surname"]]

最后会产生:

> dt[ ]
     name surname maiden_name id year flag1 flag2
 1: Carol  Clymer              3    1 FALSE FALSE
 2:   Jim   Jones              2    1 FALSE FALSE
 3:   Joe   Smith              1    1 FALSE FALSE
 4:   Joe   Smith              1    1 FALSE FALSE
 5:   Ann  Cotter              4    2    NA    NA
 6: Carol   Klein      Clymer  3    2    NA  TRUE
 7:   Joe   Smith              1    2  TRUE FALSE
 8:   Ann  Cotter              4    3  TRUE FALSE
 9:  Beth   Brown              5    3    NA    NA
10:   Joe   Smith              1    4  TRUE FALSE
11:   Joe   Smith              1    4  TRUE FALSE

剩下的一个问题是某些标志应该是 F 已重置为 NA ;能够设置 nomatch = F 会很好,但是我不太担心这种副作用-我的关键是知道每个标志何时 T

One problem remaining is that some flags that should be F have reset to NA; would be nice to be able to set nomatch=F, but I'm not too worried about this side effect--the key for me is knowing when each flag is T.

这篇关于条件键联接/更新_and_更新标志列以进行匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆