在R中联接/匹配数据帧 [英] Joining/matching data frames in R
问题描述
我有两个数据帧.
第一列有两列:x
是水深,y
是每个深度的温度.
第二个也有两列,x
也是水深,但是与第一个表中的深度不同.第二列z
是盐度.
我想通过将z
添加到第一个表中,以x
联接两个表.我已经学习了如何在tidyr
中使用'key'联接表,但这仅在键相同的情况下有效.这两个表中的x
不同.
我想做的是将表2中的深度x
匹配到表1中深度的10%以内(即将表2 x
中的1.1匹配到表1 x
中的1.0). /p>
我该怎么做?
Table 1
| x | y |
|---|----|
| 1 | 25 |
| 2 | 26 |
| 3 | 27 |
Table 2
| x | z |
|------|----|
| 1.1 | 30 |
| 2.05 | 35 |
| 3.8 | 34 |
我想要
Table 1
| x | y | z |
|---|----|----|
| 1 | 25 | 30 |
| 2 | 26 | 35 |
| 3 | 27 | NA |
前两个'x'值在Table 2
中具有匹配项(绝对差在10%以内).第三个没有.
在提供数据后进行
从@MKR的帖子中获取数据定义:
library(fuzzyjoin)
fuzzy_left_join(Table_1, Table_2,match_fun = function(x,y) y> x & y<=1.1*x )
# Joining by: "x"
# x.x y x.y z
# 1 1 25 1.10 30
# 2 2 26 2.05 35
# 3 3 27 NA NA
带有假数据的一般性解释(第一个答案)
假数据
iris1 <- head(iris[1:2])
iris1
# Sepal.Length Sepal.Width
# 1 5.1 3.5
# 2 4.9 3.0
# 3 4.7 3.2
# 4 4.6 3.1
# 5 5.0 3.6
# 6 5.4 3.9
iris2 <- head(iris[c(1,3)])
set.seed(1)
# add noise
iris2$Sepal.Length <- iris2$Sepal.Length + rnorm(6,sd=0.05)
# shuffle rows
iris2 <- iris2[sample(seq(nrow(iris2))),]
iris2
# Sepal.Length Petal.Length
# 5 5.016475 1.4
# 2 4.909182 1.4
# 4 4.679764 1.5
# 6 5.358977 1.7
# 3 4.658219 1.3
# 1 5.068677 1.4
代码
library(fuzzyjoin)
fuzzy_left_join(iris1,iris2,match_fun= function(x,y) y>0.99*x & y<1.01*x )
# Joining by: "Sepal.Length"
# Sepal.Length.x Sepal.Width Sepal.Length.y Petal.Length
# 1 5.1 3.5 5.068677 1.4
# 2 4.9 3.0 4.909182 1.4
# 3 4.7 3.2 4.679764 1.5
# 4 4.7 3.2 4.658219 1.3
# 5 4.6 3.1 NA NA
# 6 5.0 3.6 5.016475 1.4
# 7 5.4 3.9 5.358977 1.7
我们看到一些行非常匹配,让我们看一下异常.第4行在iris2
中添加了太多噪音,因此与第2行匹配的第3行配对.当我选择左连接时,仍显示第4行,但iris2
的列带有NAs
.
据我了解:
- 连接列将展开
- 该函数将这些长列(此处为
6*6==36
个元素)作为参数 - 我们应用矢量化函数(在本例中为
<
或&
)来返回逻辑矢量,该逻辑将过滤这些长列以构建输出data.frame
.
distance_left_join
更易于使用,但它是绝对距离,而不是相对距离.
I have two data frames.
The first one has two columns: x
is water depth, y
is temperature at each depth.
The second one has two columns too, x
is also water depth, but at different depth compared to that in the first table. The second column z
is salinity.
I want to join the two tables by x
, by adding z
to the first table. I have learned how to join tables using 'key' in tidyr
, but that only works if the keys are identical. The x
in these two tables are not the same.
What I want to do is to match the depth x
in table 2 to that within 10% of that in table 1 (i.e. match 1.1 in table 2 x
to 1.0 in table 1 x
).
How can I do this?
Table 1
| x | y |
|---|----|
| 1 | 25 |
| 2 | 26 |
| 3 | 27 |
Table 2
| x | z |
|------|----|
| 1.1 | 30 |
| 2.05 | 35 |
| 3.8 | 34 |
I want
Table 1
| x | y | z |
|---|----|----|
| 1 | 25 | 30 |
| 2 | 26 | 35 |
| 3 | 27 | NA |
The first two 'x' values have matches (within 10% absolute difference) in Table 2
. The third one doesn't.
Edit after data was provided:
Taking data definition from @MKR's post:
library(fuzzyjoin)
fuzzy_left_join(Table_1, Table_2,match_fun = function(x,y) y> x & y<=1.1*x )
# Joining by: "x"
# x.x y x.y z
# 1 1 25 1.10 30
# 2 2 26 2.05 35
# 3 3 27 NA NA
general explanations with fake data (first answer)
fake data
iris1 <- head(iris[1:2])
iris1
# Sepal.Length Sepal.Width
# 1 5.1 3.5
# 2 4.9 3.0
# 3 4.7 3.2
# 4 4.6 3.1
# 5 5.0 3.6
# 6 5.4 3.9
iris2 <- head(iris[c(1,3)])
set.seed(1)
# add noise
iris2$Sepal.Length <- iris2$Sepal.Length + rnorm(6,sd=0.05)
# shuffle rows
iris2 <- iris2[sample(seq(nrow(iris2))),]
iris2
# Sepal.Length Petal.Length
# 5 5.016475 1.4
# 2 4.909182 1.4
# 4 4.679764 1.5
# 6 5.358977 1.7
# 3 4.658219 1.3
# 1 5.068677 1.4
code
library(fuzzyjoin)
fuzzy_left_join(iris1,iris2,match_fun= function(x,y) y>0.99*x & y<1.01*x )
# Joining by: "Sepal.Length"
# Sepal.Length.x Sepal.Width Sepal.Length.y Petal.Length
# 1 5.1 3.5 5.068677 1.4
# 2 4.9 3.0 4.909182 1.4
# 3 4.7 3.2 4.679764 1.5
# 4 4.7 3.2 4.658219 1.3
# 5 4.6 3.1 NA NA
# 6 5.0 3.6 5.016475 1.4
# 7 5.4 3.9 5.358977 1.7
We see that some rows matched really well, let's take a look at the exceptions. The row number 4 had too much noise added in iris2
, so it got paired with row 3, which has 2 matches. As I chose a left join, row 4 is still shown, but with NAs
for iris2
's columns.
As I understand it:
- The joining columns will be expanded
- The function takes these long columns (
6*6==36
elements here) as arguments - We an apply vectorized functions (such as
<
or&
in this case) to return a vector of logical that will filter these long columns in order to build the outputdata.frame
.
distance_left_join
is more straightforward to use, but then it's absolute distance, not relative.
这篇关于在R中联接/匹配数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!