在R中联接/匹配数据帧 [英] Joining/matching data frames in R

查看:116
本文介绍了在R中联接/匹配数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧.

第一列有两列:x是水深,y是每个深度的温度.

第二个也有两列,x也是水深,但是与第一个表中的深度不同.第二列z是盐度.

我想通过将z添加到第一个表中,以x联接两个表.我已经学习了如何在tidyr中使用'key'联接表,但这仅在键相同的情况下有效.这两个表中的x不同.

我想做的是将表2中的深度x匹配到表1中深度的10%以内(即将表2 x中的1.1匹配到表1 x中的1.0). /p>

我该怎么做?

Table 1
| x | y  |
|---|----|
| 1 | 25 |
| 2 | 26 |
| 3 | 27 |

Table 2
| x    | z  |
|------|----|
| 1.1  | 30 |
| 2.05 | 35 |
| 3.8  | 34 |

我想要

Table 1
| x | y  | z  |
|---|----|----|
| 1 | 25 | 30 |
| 2 | 26 | 35 |
| 3 | 27 | NA |

前两个'x'值在Table 2中具有匹配项(绝对差在10%以内).第三个没有.

解决方案

在提供数据后进行

从@MKR的帖子中获取数据定义:

library(fuzzyjoin)
fuzzy_left_join(Table_1, Table_2,match_fun = function(x,y)  y> x & y<=1.1*x )
# Joining by: "x"
#   x.x  y  x.y  z
# 1   1 25 1.10 30
# 2   2 26 2.05 35
# 3   3 27   NA NA


带有假数据的一般性解释(第一个答案)

假数据

iris1 <- head(iris[1:2])
iris1
#   Sepal.Length Sepal.Width
# 1          5.1         3.5
# 2          4.9         3.0
# 3          4.7         3.2
# 4          4.6         3.1
# 5          5.0         3.6
# 6          5.4         3.9

iris2 <- head(iris[c(1,3)])
set.seed(1)

# add noise
iris2$Sepal.Length <- iris2$Sepal.Length + rnorm(6,sd=0.05)

# shuffle rows
iris2 <- iris2[sample(seq(nrow(iris2))),]

iris2
#   Sepal.Length Petal.Length
# 5     5.016475          1.4
# 2     4.909182          1.4
# 4     4.679764          1.5
# 6     5.358977          1.7
# 3     4.658219          1.3
# 1     5.068677          1.4

代码

library(fuzzyjoin)
fuzzy_left_join(iris1,iris2,match_fun= function(x,y) y>0.99*x & y<1.01*x )
# Joining by: "Sepal.Length"
# Sepal.Length.x Sepal.Width Sepal.Length.y Petal.Length
# 1            5.1         3.5       5.068677          1.4
# 2            4.9         3.0       4.909182          1.4
# 3            4.7         3.2       4.679764          1.5
# 4            4.7         3.2       4.658219          1.3
# 5            4.6         3.1             NA           NA
# 6            5.0         3.6       5.016475          1.4
# 7            5.4         3.9       5.358977          1.7

我们看到一些行非常匹配,让我们看一下异常.第4行在iris2中添加了太多噪音,因此与第2行匹配的第3行配对.当我选择左连接时,仍显示第4行,但iris2的列带有NAs.

据我了解:

  • 连接列将展开
  • 该函数将这些长列(此处为6*6==36个元素)作为参数
  • 我们应用矢量化函数(在本例中为<&)来返回逻辑矢量,该逻辑将过滤这些长列以构建输出data.frame.

distance_left_join更易于使用,但它是绝对距离,而不是相对距离.

I have two data frames.

The first one has two columns: x is water depth, y is temperature at each depth.

The second one has two columns too, x is also water depth, but at different depth compared to that in the first table. The second column z is salinity.

I want to join the two tables by x, by adding z to the first table. I have learned how to join tables using 'key' in tidyr , but that only works if the keys are identical. The x in these two tables are not the same.

What I want to do is to match the depth x in table 2 to that within 10% of that in table 1 (i.e. match 1.1 in table 2 x to 1.0 in table 1 x).

How can I do this?

Table 1
| x | y  |
|---|----|
| 1 | 25 |
| 2 | 26 |
| 3 | 27 |

Table 2
| x    | z  |
|------|----|
| 1.1  | 30 |
| 2.05 | 35 |
| 3.8  | 34 |

I want

Table 1
| x | y  | z  |
|---|----|----|
| 1 | 25 | 30 |
| 2 | 26 | 35 |
| 3 | 27 | NA |

The first two 'x' values have matches (within 10% absolute difference) in Table 2. The third one doesn't.

解决方案

Edit after data was provided:

Taking data definition from @MKR's post:

library(fuzzyjoin)
fuzzy_left_join(Table_1, Table_2,match_fun = function(x,y)  y> x & y<=1.1*x )
# Joining by: "x"
#   x.x  y  x.y  z
# 1   1 25 1.10 30
# 2   2 26 2.05 35
# 3   3 27   NA NA


general explanations with fake data (first answer)

fake data

iris1 <- head(iris[1:2])
iris1
#   Sepal.Length Sepal.Width
# 1          5.1         3.5
# 2          4.9         3.0
# 3          4.7         3.2
# 4          4.6         3.1
# 5          5.0         3.6
# 6          5.4         3.9

iris2 <- head(iris[c(1,3)])
set.seed(1)

# add noise
iris2$Sepal.Length <- iris2$Sepal.Length + rnorm(6,sd=0.05)

# shuffle rows
iris2 <- iris2[sample(seq(nrow(iris2))),]

iris2
#   Sepal.Length Petal.Length
# 5     5.016475          1.4
# 2     4.909182          1.4
# 4     4.679764          1.5
# 6     5.358977          1.7
# 3     4.658219          1.3
# 1     5.068677          1.4

code

library(fuzzyjoin)
fuzzy_left_join(iris1,iris2,match_fun= function(x,y) y>0.99*x & y<1.01*x )
# Joining by: "Sepal.Length"
# Sepal.Length.x Sepal.Width Sepal.Length.y Petal.Length
# 1            5.1         3.5       5.068677          1.4
# 2            4.9         3.0       4.909182          1.4
# 3            4.7         3.2       4.679764          1.5
# 4            4.7         3.2       4.658219          1.3
# 5            4.6         3.1             NA           NA
# 6            5.0         3.6       5.016475          1.4
# 7            5.4         3.9       5.358977          1.7

We see that some rows matched really well, let's take a look at the exceptions. The row number 4 had too much noise added in iris2, so it got paired with row 3, which has 2 matches. As I chose a left join, row 4 is still shown, but with NAs for iris2's columns.

As I understand it:

  • The joining columns will be expanded
  • The function takes these long columns (6*6==36 elements here) as arguments
  • We an apply vectorized functions (such as < or & in this case) to return a vector of logical that will filter these long columns in order to build the output data.frame.

distance_left_join is more straightforward to use, but then it's absolute distance, not relative.

这篇关于在R中联接/匹配数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆