条件键联接/更新_and_更新标志列以进行匹配 [英] Conditional keyed join/update _and_ update a flag column for matches
问题描述
这非常类似于问题 @DavidArenburg询问了条件键联接,还有一个我似乎无法解决的额外错误。
This is very similar to the question @DavidArenburg asked about conditional keyed joins, with an additional bugbear that I can't seem to suss out.
除了条件条件外,基本上联接,我想定义一个标志,说明匹配发生在匹配过程的哪一步;我的问题是我只能获得为所有值定义的标志,而不能为匹配的值定义。
Basically, in addition to a conditional join, I want to define a flag saying at which step of the matching process that the match occurred; my problem is that I can only get the flag to define for all values, not the matched values.
在这里,我希望这是最小的工作示例:
Here's what I hope is a minimal working example:
DT = data.table(
name = c("Joe", "Joe", "Jim", "Carol", "Joe",
"Carol", "Ann", "Ann", "Beth", "Joe", "Joe"),
surname = c("Smith", "Smith", "Jones",
"Clymer", "Smith", "Klein", "Cotter",
"Cotter", "Brown", "Smith", "Smith"),
maiden_name = c("", "", "", "", "", "Clymer",
"", "", "", "", ""),
id = c(1, 1:3, rep(NA, 7)),
year = rep(1:4, c(4, 3, 2, 2)),
flag1 = NA, flag2 = NA, key = "year"
)
DT
# name surname maiden_name id year flag1 flag2
# 1: Joe Smith 1 1 FALSE FALSE
# 2: Joe Smith 1 1 FALSE FALSE
# 3: Jim Jones 2 1 FALSE FALSE
# 4: Carol Clymer 3 1 FALSE FALSE
# 5: Joe Smith NA 2 FALSE FALSE
# 6: Carol Klein Clymer NA 2 FALSE FALSE
# 7: Ann Cotter NA 2 FALSE FALSE
# 8: Ann Cotter NA 3 FALSE FALSE
# 9: Beth Brown NA 3 FALSE FALSE
# 10: Joe Smith NA 4 FALSE FALSE
# 11: Joe Smith NA 4 FALSE FALSE
我的方法是每年首先尝试匹配上一年的名字/姓氏;如果失败,则尝试匹配名字/姓氏。我想定义 flag1
表示完全匹配,定义 flag2
表示婚姻。
My approach is, for each year, to first try and match on first name/last name from a prior year; if that fails, then try to match on first name/maiden name. I want to define flag1
to denote an exact match and flag2
to denote a marriage.
for (yr in 2:4) {
#which ids have we hit so far?
existing_ids = DT[.(yr), unique(id)]
#find people in prior years appearing to
# correspond to those people
unmatched =
DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N], by = id]
setkey(unmatched, name, surname)
#merge a la Arun, define flag1
setkey(DT, name, surname)
DT[year == yr, c("id", "flag1") := unmatched[.SD, .(id, TRUE)]]
setkey(DT, year)
#repeat, this time keying on name/maiden_name
existing_ids = DT[.(yr), unique(id)]
unmatched =
DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N],by=id]
setkey(unmatched, name, surname)
#now define flag2 = TRUE
setkey(DT, name, maiden_name)
DT[year==yr & is.na(id), c("id", "flag2") := unmatched[.SD, .(id, TRUE)]]
setkey(DT, year)
#this is messy, but I'm trying to increment id
# for "new" individuals
setkey(DT, name, surname, maiden_name)
DT[year == yr & is.na(id),
id := unique(
DT[year == yr & is.na(id)],
by = c("name", "surname", "maiden_name")
)[ , count := .I][.SD, count] + DT[ , max(id, na.rm = TRUE)]
]
#re-sort by year at the end
setkey(DT, year)
}
我希望通过在我定义<$时在 j
参数中包含 TRUE
值c $ c> id ,只有匹配的 name
s(例如,第一步的Joe)才具有标志
更新为 TRUE
,但事实并非如此-它们都已更新:
I was hoping that by including the TRUE
value in the j
argument while I define id
, only the matched name
s (e.g., Joe at the first step) would have their flag
updated to TRUE
, but this isn't the case--they are all updated:
DT[]
# name surname maiden_name id year flag1 flag2
# 1: Carol Clymer 3 1 FALSE FALSE
# 2: Jim Jones 2 1 FALSE FALSE
# 3: Joe Smith 1 1 FALSE FALSE
# 4: Joe Smith 1 1 FALSE FALSE
# 5: Ann Cotter 4 2 TRUE TRUE
# 6: Carol Klein Clymer 3 2 TRUE TRUE
# 7: Joe Smith 1 2 TRUE FALSE
# 8: Ann Cotter 4 3 TRUE FALSE
# 9: Beth Brown 5 3 TRUE TRUE
# 10: Joe Smith 1 4 TRUE FALSE
# 11: Joe Smith 1 4 TRUE FALSE
有什么方法可以只更新匹配行的 flag
值吗?理想的输出如下:
Is there any way to update only the matched rows' flag
values? Ideal output is as follows:
DT[]
# name surname maiden_name id year flag1 flag2
# 1: Carol Clymer 3 1 FALSE FALSE
# 2: Jim Jones 2 1 FALSE FALSE
# 3: Joe Smith 1 1 FALSE FALSE
# 4: Joe Smith 1 1 FALSE FALSE
# 5: Ann Cotter 4 2 FALSE FALSE
# 6: Carol Klein Clymer 3 2 FALSE TRUE
# 7: Joe Smith 1 2 TRUE FALSE
# 8: Ann Cotter 4 3 TRUE FALSE
# 9: Beth Brown 5 3 FALSE FALSE
# 10: Joe Smith 1 4 TRUE FALSE
# 11: Joe Smith 1 4 TRUE FALSE
推荐答案
关键(没有双关语)我认为是要意识到合并返回了 NA
表示丢失的ID,因此我应该在每个字段中将 flag
添加到 unmatched
步骤,例如,在第1步:
The key (no pun intended) I think was to realize that the merge was returning NA
for the missed IDs, so I should add the flag
to unmatched
at each step, e.g., at step 1:
unmatched <- dt[.(1:(yr - 1L))
][!id %in% existing_ids,
.SD[.N], by = id][ , flag1 := TRUE]
dt[year == yr, c("id", "flag1") :=
unmatched[.SD, .(id, flag1), on = "name,surname"]]
最后会产生:
> dt[ ]
name surname maiden_name id year flag1 flag2
1: Carol Clymer 3 1 FALSE FALSE
2: Jim Jones 2 1 FALSE FALSE
3: Joe Smith 1 1 FALSE FALSE
4: Joe Smith 1 1 FALSE FALSE
5: Ann Cotter 4 2 NA NA
6: Carol Klein Clymer 3 2 NA TRUE
7: Joe Smith 1 2 TRUE FALSE
8: Ann Cotter 4 3 TRUE FALSE
9: Beth Brown 5 3 NA NA
10: Joe Smith 1 4 TRUE FALSE
11: Joe Smith 1 4 TRUE FALSE
剩下的一个问题是某些标志应该是 F
已重置为 NA
;能够设置 nomatch = F
会很好,但是我不太担心这种副作用-我的关键是知道每个标志何时 T
。
One problem remaining is that some flags that should be F
have reset to NA
; would be nice to be able to set nomatch=F
, but I'm not too worried about this side effect--the key for me is knowing when each flag is T
.
这篇关于条件键联接/更新_and_更新标志列以进行匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!