在一个数据框中选择与另一数据框中的行部分匹配的行 [英] select rows in one data frame that partially match rows in another data frame

查看:93
本文介绍了在一个数据框中选择与另一数据框中的行部分匹配的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望在一个数据帧data.1中选择与第二个数据帧keep.these中的行部分匹配的行,以获得desired.result.我在这里发现了几个问题,这些问题基于一列进行匹配,但是我想在三列上进行匹配:STATECOUNTYCITY.到目前为止,我已经提出了三种解决方案,但似乎都不是理想的.

I wish to select rows in one data frame, data.1, that partially match rows in a second data frame, keep.these, to obtain the desired.result. I have found several questions here that match based on one column, but I want to match on three columns: STATE, COUNTY and CITY. I have come up with three solutions so far, but none seem ideal.

请注意,在我的真实数据中,每一行都包含STATECOUNTYCITY的唯一组合.

Note that each row contains a unique combination of STATE, COUNTY and CITY in my real data.

当我使用merge时,我必须重新输入order.函数match似乎有效,但是我不熟悉它,也不知道我是否按预期使用了该函数.下面的apply解决方案显然太复杂了.

When I use merge I must re-order. The function match seems to work, but I am not familiar with it and do not know if my use of this function is as intended. The apply solution below is clearly too complex.

如果我不必重新排列结果,则merge方法将是理想的选择.使用大型数据集进行重新排序可能很耗时.如果有人可以确认这是一种合理的方法,那么match方法似乎还可以.

The merge approach would be ideal if I did not have to reorder the result. Reordering can be time consuming with large data sets. The match approach seems okay if someone can confirm this is a reasonable approach.

是否有更好的解决方案,最好是在基础R中?

Is there a better solution, ideally in base R?

data.1 <- read.table(text= "
     CITY     COUNTY   STATE        AA
       1          1         1        2
       2          1         1        4
       1          2         1        6
       2          2         1        8
       1          1         2       20
       2          1         2       40
       1          2         2       60
       2          2         2       80
       1          1         3      200
       2          1         3      400
       1          2         3      600
       2          2         3      800
       1          1         4     2000
       2          1         4     4000
       1          2         4     6000
       2          2         4     8000
       1          1         5    20000
       2          1         5    40000
       1          2         5    60000
       2          2         5    80000
", header=TRUE, na.strings=NA)

keep.these <- read.table(text= "
     CITY     COUNTY     STATE      BB
       1          1         2      -10
       2          1         2      -11
       1          2         2      -12
       2          2         2      -13
       1          1         4      -14
       2          1         4      -15
       1          2         4      -16
       2          2         4      -17
", header=TRUE, na.strings=NA)

desired.result <- read.table(text= "
     CITY     COUNTY    STATE       AA
       1          1         2       20
       2          1         2       40
       1          2         2       60
       2          2         2       80
       1          1         4     2000
       2          1         4     4000
       1          2         4     6000
       2          2         4     8000
", header=TRUE, na.strings=NA)

##########

# this works, but I need to reorder

new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'))

new.data.a <- new.data.a[order(new.data.a$STATE, new.data.a$COUNTY, new.data.a$CITY),]

rownames(desired.result) <- NULL
rownames(new.data.a)     <- NULL

all.equal(desired.result, new.data.a)

##########

# this seems to work, but match is unfamiliar

new.data.2 <- data.1[match(data.1$CITY  , keep.these$CITY  , nomatch=0) & 
                     match(data.1$STATE , keep.these$STATE , nomatch=0) & 
                     match(data.1$COUNTY, keep.these$COUNTY, nomatch=0),]

rownames(desired.result) <- NULL
rownames(new.data.2)     <- NULL

all.equal(desired.result, new.data.2)

##########

# this works, but is too complex

data.1b      <- data.frame(my.group = apply(    data.1[,1:3], 1, paste, collapse = "."),     data.1)
keep.these.b <- data.frame(my.group = apply(keep.these[,1:3], 1, paste, collapse = "."), keep.these)

data.1b <- data.1b[apply(data.1b, 1, function(x) {x[1] %in% keep.these.b$my.group}),]
data.1b <- data.1b[,-1]

rownames(desired.result) <- NULL
rownames(data.1b)        <- NULL

all.equal(desired.result, data.1b)

##########

推荐答案

以下是针对此类问题的通用解决方案,该解决方案非常有效:

Here is a generic solution for this type of problem which is very efficient:

data.1.ID <- paste(data.1[,1],data.1[,2],data.1[,3])

keep.these.ID <- paste(keep.these[,1],keep.these[,2],keep.these[,3])

desired.result <- data.1[data.1.ID %in% keep.these.ID,]

我只是为每个记录创建了一个唯一的ID,然后对其进行了搜索. 注意:这将更改行名,您可能需要添加以下内容:

I have simply created an unique ID for each record, and then searched it. Note: This will change the row names, and you may want to add the following:

row.names(desired.result) <- 1:nrow(desired.result)

这是解决同一问题的另一种方法.

Here is another way to solve the same problem.

如果您有非常大的数据集(例如数百万行),则另一个非常有效的解决方案是使用软件包data.table.它的运行速度比merge快50-100倍,具体取决于您拥有的数据量.

If you have a very large data set, say millions of rows, another very efficient solution is using the package data.table. It works nearly 50-100 times faster than merge, depending on how much data you have.

您要做的只是以下事情:

All you have to do is the following:

library(data.table)

第一步:将前三列作为键,将data.frame转换为data.table.

Step1: Convert data.frame to data.table, with first three columns as keys.

d1 <- data.table(data.1, key=names(data.1)[1:3])
kt <- data.table(keep.these, key=names(keep.these)[1:3])

Step2:使用data.table的二进制搜索进行合并:

Step2: A merge using data.table's binary search:

d1[kt]

注1:执行的简便性. 注意2:这将按键对数据进行排序.为避免这种情况,请尝试以下操作:

Note1: The simplicity of execution. Note2: This will sort the data by key. To avoid that try following:

data.1$index <- 1:nrow(data.1)  # Add index to original data
d1 <- data.table(data.1,key=names(data.1)[1:3]) # Step1 as above
kt <- data.table(keep.these,key=names(keep.these)[1:3])  # Step1 as above
d1[kt][order(index)]  # Step2 as above

如果要删除最后两列(indexBB),这也很简单:

If you want to remove the last two columns (index, BB), that's straight forward too:

d1[kt][order(index)][,-(5:6),with=F] #Remove index

尝试使用大型数据集,然后将计时与merge进行比较.通常快50到100倍.

Try this with large data sets, and compare the timing with merge. It's typically about 50-100 times faster.

要了解有关data.table的更多信息,请尝试:

To learn more about data.table, try:

vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")

或查看实际操作:

example(data.table)

希望这会有所帮助!

这篇关于在一个数据框中选择与另一数据框中的行部分匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆