合并两个数据集近似值 [英] Merging two datasets on approximate values

查看:143
本文介绍了合并两个数据集近似值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要合并(左连接)两个数据集 x y 。合并(x,y,by.x =z,by.y =zP,all.x =zP,

I need to merge (left join) two data sets x and y.

merge(x,y, by.x = "z", by.y = "zP", all.x = TRUE)

z 的每个值不是在 zP 中,但在 zP 中必须有最接近的值。因此,我们需要在 zP 中使用最近的值来合并进程。
例如

Every value of z is not there in zP but there must be nearest value in zP. So we need to use nearest value in zP for process of merging. For example

z <- c(0.231, 0.045, 0.632, 0.217, 0.092, ...)
zP <- c(0.010,0.013, 0.017, 0.021, ...)

我们如何在R中执行?

推荐答案

根据您提供的信息,您希望保留所有观察结果在 x 中,然后对于 x 中的每个观察结果,您需要在 y中查找观察值最小化列之间的距离 z zP 。如果这是你正在寻找的,那么这样的东西可能会工作

Based on the information you provided it sounds like you want to keep all the observations in x, and then for each observation in x you want to find the observation in y that minimizes the distance between columns z and zP. If that is what you are looking for then something like this might work

> library(data.table)

> x <- data.table(z = c(0.231, 0.045, 0.632, 0.217, 0.092), k = c("A","A","B","B","B"))
> y <- data.table(zP = c(0.010, 0.813, 0.017, 0.421), m = c(1,2,3,4))

> x
       z k 
1: 0.231 A 
2: 0.045 A 
3: 0.632 B 
4: 0.217 B 
5: 0.092 B 
> y
      zP m
1: 0.010 1
2: 0.813 2
3: 0.017 3
4: 0.421 4

> find.min.zP <- function(x){
+   y[which.min(abs(x - zP)), zP]
+ }

> x[, zP := find.min.zP(z), by = z]

> x
   z k    zP
1: 0.231 A 0.421
2: 0.045 A 0.017
3: 0.632 B 0.813
4: 0.217 B 0.017
5: 0.092 B 0.017

> merge(x, y, by="zP", all.x = T, all.y = F)
      zP     z k m
1: 0.017 0.045 A 3
2: 0.017 0.217 B 3
3: 0.017 0.092 B 3
4: 0.421 0.231 A 4
5: 0.813 0.632 B 2

这是解决方案,弹出我的头,因为我使用 data.table 相当多。请注意,在这里使用 data.table 可能是也可能不是最优雅的方式,甚至可能不是最快的方式(尽管如果 x y 是一个很大的解决方案,涉及 data.table 可能是最快的)。另请注意,这可能是使用 data.table 严重的示例,因为我没有尽力优化速度。如果速度很重要,我强烈建议您阅读有关 github wiki 的帮助文档。希望有帮助。

This is the solution that popped into my head given that I use data.table quite a bit. Please note that using data.table here may or may not be the most elegant way and it may not even be the fastest way (although if x and y are large some solution involving data.table probably will be the fastest). Also note that this is likely an example of using data.table "badly" as I didn't make any effort to optimize for speed. If speed is important I would highly recommend reading the helpful documentation on the github wiki. Hope that helps.

编辑:

如我所料, data.table 提供了一个更好的方式,Arun在评论中指出。

As I suspected, data.table provides a much better way, which Arun pointed out in the comments.

> setkey(x, z)
> setkey(y, zP)
> y[x, roll="nearest"]

      zP m k
1: 0.045 3 A
2: 0.092 3 B
3: 0.217 3 B
4: 0.231 4 A
5: 0.632 2 B

唯一的区别是 z 列现在命名为 zP 和原始 zP 列离开了。如果保留该列很重要,您可以随时将 y 中的 zP 列复制到名为 z 并加入。

The only difference is that the z column is now named zP and the original zP column is gone. If preserving that column is important you can always copy the zP column in y to a new column named z and join on that.

> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> y[x, roll='nearest']
      zP m     z k
1: 0.017 3 0.045 A
2: 0.017 3 0.092 B
3: 0.017 3 0.217 B
4: 0.421 4 0.231 A
5: 0.813 2 0.632 B

,但真正的改进是在大数据集的计算时间。

This is slightly less typing, but the real improvement is in compute times with large datasets.

> x <- data.table(z = runif(100000, 0, 100), k = sample(LETTERS, 100000, replace = T))
> y <- data.table(zP = runif(50000, 0, 100), m = sample(letters, 50000, replace = T))

> start <- proc.time()
> x[, zP := find.min.zP(z), by = z]
> slow <- merge(x, y, by="zP", all.x = T, all.y = F)
> proc.time() - start
  user  system elapsed 
104.849  0.072 106.432 

> x[, zP := NULL] # Drop the zP column we added to x doing the merge the slow way
> start <- proc.time()
> y[, z := zP]
> setkey(x, z)
> setkey(y, z)
> fast <- y[x, roll='nearest']
> proc.time() - start
 user  system elapsed 
0.046   0.000   0.045

# Reorder the rows and columns so that we can compare the two data tables
> setkey(slow, z)
> setcolorder(slow, c("z", "zP", "k", "m"))
> setcolorder(fast, c("z", "zP", "k", "m"))
> all.equal(slow, fast)
TRUE

请注意,更快的方法是2,365倍快!我预计,对于超过100,000次观察(相对较小的这些天)的数据集,时间的增长更加显着。这就是为什么阅读 data.table 文档是值得的,如果你正在使用大型数据集。您通常可以使用内置的方法来实现非常大的加速,但是除非你看起来,否则你不会知道他们在那里。

Notice, that the faster method is 2,365 times faster! I would expect the time gains to be even more dramatic for a data set with more than 100,000 observations (which is relatively small these days). This is why reading the data.table documentation is worth while if you are working with large data sets. You can often achieve very large speed ups by using the built in methods, but you won't know that they're there unless you look.

这篇关于合并两个数据集近似值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆