根据不同数据框中的2个键定义数据框中的列的值 [英] Define the value of a column in a dataframe based on 2 keys from a different dataframe

查看:33
本文介绍了根据不同数据框中的2个键定义数据框中的列的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框:

a <- seq(0, 5, by = 0.25)
b <-seq(0, 20, by = 1)
df <- data.frame(a, b)

,我想基于a和b列以及下面的转换表创建一个新的值列:

and I'd like to create a new column "value", based on columns a and b, and the conversion table below:

a_min <- c(0,2, 0,2)
a_max <- c(2,5,2,5)
b_min <- c(0,0,10,10)
b_max <- c(10,10,30,30)
output <-c(1,2,3,4)

conv <- data.frame(a_min, a_max, b_min, b_max, output)

我尝试使用dplyr :: mutate进行操作,但没有太大的成功...

I've tried to do it using dplyr::mutate without much success...

require(dplyr)
mutate(df, value = calcula(conv, a, b))

更长的对象长度不是较短对象长度的倍数

我的期望是获得像上面的'df'这样的数据框,并附加如下所示的列值:

My expectation would be to obtain a dataframe like the 'df' above with the additional column value as per below:

df$value <- c(rep(1,8), rep(2,2), rep(4,11))


推荐答案

这是矩阵的另一种选择。

One more option, this time with matrices.

with(df, with(conv, output[max.col(
    outer(a, a_min, `>=`) + outer(a, a_max, `<=`) +
    outer(b, b_min, `>=`) + outer(b, b_max, `<=`))]))

## [1] 1 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4

外部比较 df 中向量的每个元素,均来自 conv 中向量的每个元素,从而为每个调用生成布尔矩阵。由于 TRUE 为1,如果将所有四个矩阵相加,则所需索引将是 TRUE 最多的列s,您可以通过 max.col 获得。将子集输出进行分组,就可以得到结果。

outer compares each element of the vector from df from the one from conv, producing a matrix of Booleans for each call. Since TRUE is 1, if you add all four matrices, the index you want will be the column with the most TRUEs, which you can get with max.col. Subset output, and you've got your result.

使用矩阵的好处是它们 fast 。在1000行上使用@Phann的基准测试:

The benefit of working with matrices is that they're fast. Using @Phann's benchmarks on 1,000 rows:

Unit: microseconds
      expr       min         lq       mean     median         uq       max neval   cld
 alistaire   276.099   320.4565   349.1045   339.8375   357.2705   941.551   100 a    
      akr1   830.934   966.6705  1064.8433  1057.6610  1152.3565  1507.180   100 ab   
      akr2 11431.246 11731.3125 12835.5229 11947.5775 12408.4715 36767.488   100    d 
       Pha 11985.129 12403.1095 13330.1465 12660.4050 13044.9330 29653.842   100    d 
       Ron 71132.626 74300.3540 81136.9408 78034.2275 88952.8765 98950.061   100     e
      Dav1  2506.205  2765.4095  2971.6738  2948.6025  3082.4025  4065.368   100   c  
      Dav2  2104.481  2272.9180  2480.9570  2478.8775  2575.8740  3683.896   100  bc  

并在100,000行上:

and on 100,000 rows:

Unit: milliseconds
      expr      min       lq     mean   median       uq       max neval cld
 alistaire 30.00677 36.49348 44.28828 39.43293 54.28207  64.36581   100 a  
      akr1 36.24467 40.04644 48.46986 41.59644 60.15175  77.34415   100 a  
      Dav1 51.74218 57.23488 67.70289 64.11002 68.86208 382.25182   100   c
      Dav2 48.48227 54.82818 60.25256 59.81041 64.92611  91.20212   100  b 

这篇关于根据不同数据框中的2个键定义数据框中的列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆