如果其他列匹配100%,则查找一列中值的最接近匹配项 [英] Find closest match for values in one column, if additional columns match 100%

查看:82
本文介绍了如果其他列匹配100%,则查找一列中值的最接近匹配项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个长度不等的数据帧。我想将df2中的value2添加到df1中,其中value1a和value1b之间的差异最小。并非df1中的所有行都会收到df 2中的匹配项,因为df2较短。在找到最接近的匹配值(值1a和值1b)之前,st,year和doy必须在df1和df2之间匹配,因为数据集包含同一站点上几年的多次采样。

I have two data frames of unequal lengths. I would like to add value2 from df2 to df1 where the difference between value1a and value1b is minimal. Not all rows in df1 will receive a match from df 2 because df2 is shorter. st, year and doy must match between df1 and df2 before the closest match is found (value1a and value1b) as the dataset comprises multiple samplings over several years at the same stations.

st <- c("ST1", "ST2", "ST2", "ST2", "ST3")
year <- c(2011, 2011, 2012, 2012, 2013)
doy <- c(20,29,4,4,20)

value1a <- c(200, 250, 240, 250, 260)
value1b <- c(201, 258, 240, 251, 180)
value2 <- c(5,6,7,8.5,10)

df1 <- data.frame(st, year, doy, value1a)
df2 <- data.frame(st, year, doy, value1b, value2); df2 <- df2[1:4,]

我缩短了df2,以保留相同的列名,但表明可能具有与df1不同的行数和信息。
理想情况下,输出将是这样的:

I shortened df2 to keep the same column names but show that it may have not the same number of rows and information as df1. Ideally, the output would be something like this:

     st year    doy value1a value1b value2
1   ST1 2011    20  200      201    5.0
2   ST2 2011    29  250      258    6.0
3   ST2 2012    4   240      240    7.0
4   ST2 2012    4   250      251    8.5
5   ST3 2013    20  260      NA     NA

我已经研究了closest.match()和相关函数,但是当其他列中的先前匹配项必须进行时,它却无法使其工作成功。有任何想法吗?

I have looked into the closest.match() and related functions, but have not been able to get it working when prior matches in other columns must be succesful. Any ideas?

*编辑:我根据下面的建议更改了数据框,并希望它可以解决问题。

* I changed the dataframes according to the suggestions below and hope it clarifies the problem. I apologize to everyone who already put in the effort to answer!

对于海洋学家周围的实际情况,我表示歉意。

For practical illustration in case oceanographers are around: I am working with water column data. At the same station on the same day of a year, certain data are obtained during the first sampling according to value1a over depth in the water. Then this is repeated on the same day of a year only very slightly later (hence year and doy match), but automatically recorded value1b is slightly different. I would like to match value2 obtained only on the second sampling with the data obtained on the first sampling.

我正在研究水柱数据。在一年中同一天的同一站点,根据水深上的值1a在第一次采样过程中获得了某些数据。然后,在一年的同一天仅稍稍晚些(因此年份和doy匹配)重复此操作,但是自动记录的value1b略有不同。我想将仅在第二次采样中获得的value2与第一次采样中获得的数据进行匹配。

解决方案

推荐答案

尝试取消 value1a和value1b之间的差异很小 语句,这是一个两步操作:合并/联接,然后根据该差异进行减小。

Because the current sample data doesn't test this requirement, I'll duplicate one of the rows with slightly different values to show that the minimal is being chosen.

由于当前样本数据无法满足这一要求,因此我将复制其中一行值稍有不同的行,以表明正在选择最小值。

df2 <- rbind(df2, transform(df2[2,], value1b = 300, value2 = 6.1)) df2 # st year doy2 value1b value2 # 1 ST1 2011 21 201 5.0 # 2 ST2 2011 29 258 6.0 # 3 ST2 2012 4 240 7.0 # 4 ST2 2013 20 180 8.5 # 21 ST2 2011 29 300 6.1

Also, because I am not certain that the df1 rows are perfectly unique with respect to st, year, and doy, I'll add an rn (row number) field in each step so that I can be certain to reduce correctly. (That is, if I don't do this and those three fields are not unique, then I will over-reduce your data.)

此外,因为我不确定 df1 行是完美的关于 st doy 的问题,我将在每个步骤中添加一个 rn (行号)字段,以确保可以正确减少。 (也就是说,如果我不这样做并且这三个字段不是唯一的,那么我将过度缩减您的数据。)

library(dplyr) df1 %>% mutate(rn = row_number()) %>% left_join(., df2, by = c("st", "year", doy = "doy2")) %>% arrange(abs(value1a - value1b)) %>% group_by(rn) %>% slice(1) %>% ungroup() # # A tibble: 4 x 7 # st year doy value1a rn value1b value2 # <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> # 1 ST1 2011 20 200 1 NA NA # 2 ST2 2011 29 250 2 258 6 # 3 ST2 2012 4 240 3 240 7 # 4 ST2 2013 20 260 4 180 8.5


data.table


一些注意事项:

data.table

Some notes:


  • 我正在使用 magrittr 的管道(%>%)纯粹是为了演示,因为我认为它在视觉上可以很好地分解事物;

  • data.table 合并事物的方式, doy 字段保留 right 表的名称( df2 ),所以我用 data.table :: setnames重命名它

  • data.table 中的一个小功能/错误,当最后一个 [ -operation是具有:= 的赋值;这是常见问题解答2.23 ,它仅是美学目的,如果要执行其他操作,则不是影响因素;因此,我添加了一个看似空的%>%。[] 以便在此处将其打印到控制台中

  • I'm using magrittr's pipe (%>%) purely for demonstration, as I think it breaks things out visually very well; it is not required
  • the way data.table merges things, the doy field retains the name of the right table (df2), so I rename it with data.table::setnames
  • a small feature/bug in data.table prevents first-time printing of a result when the last [-operation is an assignment with :=; this is FAQ 2.23, and is both aesthetic-only and not a factor if there's another operation; because of this, I added a seemingly empty %>% .[] to print it to the console here
library(data.table)
library(magrittr)
setDT(df1)
setDT(df2)
df1[, rn := .I] %>%
  df2[., on = .(st, year, doy2 = doy)] %>%
  .[ order(abs(value1a - value1b)), ] %>%
  .[, .SD[1,], by = .(rn) ] %>%
  .[, rn := NULL] %>%
  setnames(., old = "doy2", new = "doy") %>%
  .[]
#     st year doy value1b value2 value1a
# 1: ST2 2012   4     240    7.0     240
# 2: ST2 2011  29     258    6.0     250
# 3: ST2 2013  20     180    8.5     260
# 4: ST1 2011  20      NA     NA     200

(有一些方法可以告诉 data.table 找到 mult 个匹配项时该怎么做。我没有测试过是否可以保证顺序,但是如果有办法确定地知道这一点,那么也许可以简化一些。)

(There are ways to tell data.table what to do when multiple matches are found. I did not test to see if order can be assured, but if there's a way to know that with certainty, then perhaps this can be simplified a little.)

(使用原始的not- data.table 版本的 df1 df2 。)

(Using the original, not-data.table versions of df1 and df2.)

由于类似的原因,这会创建几个行号计数器,类似于 data.table 解决方案中的 rn

This creates a couple of row-number counters, similar to rn in the data.table solution, for similar reasons.

sqldf::sqldf("
  with t1 as (
    select df1.*, row_number() over () as rn1
    from df1
  ),
  t12 as (
    select t1.*, t2.value2,
      row_number() over (partition by t1.rn1) as rn2
    from t1
      left join df2 t2 on t1.st=t2.st and t1.year=t2.year and t1.doy=t2.doy2
    order by abs(value1a - value1b)
  )
  select st, year, doy, value1a, value2 from t12 where rn2 = 1")
#    st year doy value1a value2
# 1 ST1 2011  20     200     NA
# 2 ST2 2012   4     240    7.0
# 3 ST2 2011  29     250    6.0
# 4 ST2 2013  20     260    8.5

这篇关于如果其他列匹配100%,则查找一列中值的最接近匹配项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆