选择一个数据框中与另一个数据框中的行部分匹配的行 [英] select rows in one data frame that partially match rows in another data frame

查看:27
本文介绍了选择一个数据框中与另一个数据框中的行部分匹配的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望在一个数据帧 data.1 中选择与第二个数据帧 keep.these 中的行部分匹配的行,以获得 desired.result.我在这里发现了几个基于一列匹配的问题,但我想匹配三列:STATECOUNTYCITY.到目前为止,我提出了三个解决方案,但似乎都不是理想的.

I wish to select rows in one data frame, data.1, that partially match rows in a second data frame, keep.these, to obtain the desired.result. I have found several questions here that match based on one column, but I want to match on three columns: STATE, COUNTY and CITY. I have come up with three solutions so far, but none seem ideal.

请注意,在我的真实数据中,每一行都包含 STATECOUNTYCITY 的唯一组合.

Note that each row contains a unique combination of STATE, COUNTY and CITY in my real data.

当我使用merge 时,我必须重新order.函数 match 似乎可以工作,但我不熟悉它,不知道我对这个函数的使用是否符合预期.下面的 apply 解决方案显然太复杂了.

When I use merge I must re-order. The function match seems to work, but I am not familiar with it and do not know if my use of this function is as intended. The apply solution below is clearly too complex.

如果我不必对结果重新排序,merge 方法将是理想的.对于大型数据集,重新排序可能很耗时.如果有人可以确认这是一种合理的方法,那么 match 方法似乎没问题.

The merge approach would be ideal if I did not have to reorder the result. Reordering can be time consuming with large data sets. The match approach seems okay if someone can confirm this is a reasonable approach.

是否有更好的解决方案,最好是在基本的 R 中?

Is there a better solution, ideally in base R?

data.1 <- read.table(text= "
     CITY     COUNTY   STATE        AA
       1          1         1        2
       2          1         1        4
       1          2         1        6
       2          2         1        8
       1          1         2       20
       2          1         2       40
       1          2         2       60
       2          2         2       80
       1          1         3      200
       2          1         3      400
       1          2         3      600
       2          2         3      800
       1          1         4     2000
       2          1         4     4000
       1          2         4     6000
       2          2         4     8000
       1          1         5    20000
       2          1         5    40000
       1          2         5    60000
       2          2         5    80000
", header=TRUE, na.strings=NA)

keep.these <- read.table(text= "
     CITY     COUNTY     STATE      BB
       1          1         2      -10
       2          1         2      -11
       1          2         2      -12
       2          2         2      -13
       1          1         4      -14
       2          1         4      -15
       1          2         4      -16
       2          2         4      -17
", header=TRUE, na.strings=NA)

desired.result <- read.table(text= "
     CITY     COUNTY    STATE       AA
       1          1         2       20
       2          1         2       40
       1          2         2       60
       2          2         2       80
       1          1         4     2000
       2          1         4     4000
       1          2         4     6000
       2          2         4     8000
", header=TRUE, na.strings=NA)

##########

# this works, but I need to reorder

new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'))

new.data.a <- new.data.a[order(new.data.a$STATE, new.data.a$COUNTY, new.data.a$CITY),]

rownames(desired.result) <- NULL
rownames(new.data.a)     <- NULL

all.equal(desired.result, new.data.a)

##########

# this seems to work, but match is unfamiliar

new.data.2 <- data.1[match(data.1$CITY  , keep.these$CITY  , nomatch=0) & 
                     match(data.1$STATE , keep.these$STATE , nomatch=0) & 
                     match(data.1$COUNTY, keep.these$COUNTY, nomatch=0),]

rownames(desired.result) <- NULL
rownames(new.data.2)     <- NULL

all.equal(desired.result, new.data.2)

##########

# this works, but is too complex

data.1b      <- data.frame(my.group = apply(    data.1[,1:3], 1, paste, collapse = "."),     data.1)
keep.these.b <- data.frame(my.group = apply(keep.these[,1:3], 1, paste, collapse = "."), keep.these)

data.1b <- data.1b[apply(data.1b, 1, function(x) {x[1] %in% keep.these.b$my.group}),]
data.1b <- data.1b[,-1]

rownames(desired.result) <- NULL
rownames(data.1b)        <- NULL

all.equal(desired.result, data.1b)

##########

推荐答案

以下是此类问题的通用解决方案,非常有效:

Here is a generic solution for this type of problem which is very efficient:

data.1.ID <- paste(data.1[,1],data.1[,2],data.1[,3])

keep.these.ID <- paste(keep.these[,1],keep.these[,2],keep.these[,3])

desired.result <- data.1[data.1.ID %in% keep.these.ID,]

我只是为每条记录创建了一个唯一的 ID,然后对其进行了搜索.注意:这将更改行名称,您可能需要添加以下内容:

I have simply created an unique ID for each record, and then searched it. Note: This will change the row names, and you may want to add the following:

row.names(desired.result) <- 1:nrow(desired.result)

这是解决相同问题的另一种方法.

Here is another way to solve the same problem.

如果您有一个非常大的数据集,比如数百万行,另一个非常有效的解决方案是使用包data.table.它的运行速度比 merge 快近 50-100 倍,具体取决于您拥有的数据量.

If you have a very large data set, say millions of rows, another very efficient solution is using the package data.table. It works nearly 50-100 times faster than merge, depending on how much data you have.

您所要做的就是:

library(data.table)

第一步:将data.frame转换为data.table,以前三列为键.

Step1: Convert data.frame to data.table, with first three columns as keys.

d1 <- data.table(data.1, key=names(data.1)[1:3])
kt <- data.table(keep.these, key=names(keep.these)[1:3])

Step2:使用data.table的二分查找合并:

Step2: A merge using data.table's binary search:

d1[kt]

注1:执行的简单性.注意2:这将按键对数据进行排序.为避免这种情况,请尝试以下操作:

Note1: The simplicity of execution. Note2: This will sort the data by key. To avoid that try following:

data.1$index <- 1:nrow(data.1)  # Add index to original data
d1 <- data.table(data.1,key=names(data.1)[1:3]) # Step1 as above
kt <- data.table(keep.these,key=names(keep.these)[1:3])  # Step1 as above
d1[kt][order(index)]  # Step2 as above

如果你想删除最后两列(indexBB),那也很简单:

If you want to remove the last two columns (index, BB), that's straight forward too:

d1[kt][order(index)][,-(5:6),with=F] #Remove index

用大数据集试试这个,并用 merge 比较时间.它通常快 50-100 倍.

Try this with large data sets, and compare the timing with merge. It's typically about 50-100 times faster.

要了解有关 data.table 的更多信息,请尝试:

To learn more about data.table, try:

vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")

或者看看它的实际效果:

Or see it in action:

example(data.table)

希望这有帮助!!

这篇关于选择一个数据框中与另一个数据框中的行部分匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆