选择一个数据框中与另一个数据框中的行部分匹配的行 [英] select rows in one data frame that partially match rows in another data frame
问题描述
我希望在一个数据帧 data.1
中选择与第二个数据帧 keep.these
中的行部分匹配的行,以获得 desired.result
.我在这里发现了几个基于一列匹配的问题,但我想匹配三列:STATE
、COUNTY
和 CITY
.到目前为止,我提出了三个解决方案,但似乎都不是理想的.
I wish to select rows in one data frame, data.1
, that partially match rows in a second data frame, keep.these
, to obtain the desired.result
. I have found several questions here that match based on one column, but I want to match on three columns: STATE
, COUNTY
and CITY
. I have come up with three solutions so far, but none seem ideal.
请注意,在我的真实数据中,每一行都包含 STATE
、COUNTY
和 CITY
的唯一组合.
Note that each row contains a unique combination of STATE
, COUNTY
and CITY
in my real data.
当我使用merge
时,我必须重新order
.函数 match
似乎可以工作,但我不熟悉它,不知道我对这个函数的使用是否符合预期.下面的 apply
解决方案显然太复杂了.
When I use merge
I must re-order
. The function match
seems to work, but I am not familiar with it and do not know if my use of this function is as intended. The apply
solution below is clearly too complex.
如果我不必对结果重新排序,merge
方法将是理想的.对于大型数据集,重新排序可能很耗时.如果有人可以确认这是一种合理的方法,那么 match
方法似乎没问题.
The merge
approach would be ideal if I did not have to reorder the result. Reordering can be time consuming with large data sets. The match
approach seems okay if someone can confirm this is a reasonable approach.
是否有更好的解决方案,最好是在基本的 R
中?
Is there a better solution, ideally in base R
?
data.1 <- read.table(text= "
CITY COUNTY STATE AA
1 1 1 2
2 1 1 4
1 2 1 6
2 2 1 8
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 3 200
2 1 3 400
1 2 3 600
2 2 3 800
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
1 1 5 20000
2 1 5 40000
1 2 5 60000
2 2 5 80000
", header=TRUE, na.strings=NA)
keep.these <- read.table(text= "
CITY COUNTY STATE BB
1 1 2 -10
2 1 2 -11
1 2 2 -12
2 2 2 -13
1 1 4 -14
2 1 4 -15
1 2 4 -16
2 2 4 -17
", header=TRUE, na.strings=NA)
desired.result <- read.table(text= "
CITY COUNTY STATE AA
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
", header=TRUE, na.strings=NA)
##########
# this works, but I need to reorder
new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'))
new.data.a <- new.data.a[order(new.data.a$STATE, new.data.a$COUNTY, new.data.a$CITY),]
rownames(desired.result) <- NULL
rownames(new.data.a) <- NULL
all.equal(desired.result, new.data.a)
##########
# this seems to work, but match is unfamiliar
new.data.2 <- data.1[match(data.1$CITY , keep.these$CITY , nomatch=0) &
match(data.1$STATE , keep.these$STATE , nomatch=0) &
match(data.1$COUNTY, keep.these$COUNTY, nomatch=0),]
rownames(desired.result) <- NULL
rownames(new.data.2) <- NULL
all.equal(desired.result, new.data.2)
##########
# this works, but is too complex
data.1b <- data.frame(my.group = apply( data.1[,1:3], 1, paste, collapse = "."), data.1)
keep.these.b <- data.frame(my.group = apply(keep.these[,1:3], 1, paste, collapse = "."), keep.these)
data.1b <- data.1b[apply(data.1b, 1, function(x) {x[1] %in% keep.these.b$my.group}),]
data.1b <- data.1b[,-1]
rownames(desired.result) <- NULL
rownames(data.1b) <- NULL
all.equal(desired.result, data.1b)
##########
推荐答案
以下是此类问题的通用解决方案,非常有效:
Here is a generic solution for this type of problem which is very efficient:
data.1.ID <- paste(data.1[,1],data.1[,2],data.1[,3])
keep.these.ID <- paste(keep.these[,1],keep.these[,2],keep.these[,3])
desired.result <- data.1[data.1.ID %in% keep.these.ID,]
我只是为每条记录创建了一个唯一的 ID,然后对其进行了搜索.注意:这将更改行名称,您可能需要添加以下内容:
I have simply created an unique ID for each record, and then searched it. Note: This will change the row names, and you may want to add the following:
row.names(desired.result) <- 1:nrow(desired.result)
这是解决相同问题的另一种方法.
Here is another way to solve the same problem.
如果您有一个非常大的数据集,比如数百万行,另一个非常有效的解决方案是使用包data.table
.它的运行速度比 merge
快近 50-100 倍,具体取决于您拥有的数据量.
If you have a very large data set, say millions of rows, another very efficient solution is using the package data.table
. It works nearly 50-100 times faster than merge
, depending on how much data you have.
您所要做的就是:
library(data.table)
第一步:将data.frame
转换为data.table
,以前三列为键.
Step1: Convert data.frame
to data.table
, with first three columns as keys.
d1 <- data.table(data.1, key=names(data.1)[1:3])
kt <- data.table(keep.these, key=names(keep.these)[1:3])
Step2:使用data.table
的二分查找合并:
Step2: A merge using data.table
's binary search:
d1[kt]
注1:执行的简单性.注意2:这将按键对数据进行排序.为避免这种情况,请尝试以下操作:
Note1: The simplicity of execution. Note2: This will sort the data by key. To avoid that try following:
data.1$index <- 1:nrow(data.1) # Add index to original data
d1 <- data.table(data.1,key=names(data.1)[1:3]) # Step1 as above
kt <- data.table(keep.these,key=names(keep.these)[1:3]) # Step1 as above
d1[kt][order(index)] # Step2 as above
如果你想删除最后两列(index
、BB
),那也很简单:
If you want to remove the last two columns (index
, BB
), that's straight forward too:
d1[kt][order(index)][,-(5:6),with=F] #Remove index
用大数据集试试这个,并用 merge
比较时间.它通常快 50-100 倍.
Try this with large data sets, and compare the timing with merge
. It's typically about 50-100 times faster.
要了解有关 data.table
的更多信息,请尝试:
To learn more about data.table
, try:
vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")
或者看看它的实际效果:
Or see it in action:
example(data.table)
希望这有帮助!!
这篇关于选择一个数据框中与另一个数据框中的行部分匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!