R通过查找字典替换列 [英] R replacing columns by lookup to dictionary

查看:240
本文介绍了R通过查找字典替换列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这个问题中,我需要能够从数据框的列中查找一个不仅仅基于一个属性,而是基于与字典比较的更多属性和范围。
(是的,这实际上是 R条件的故事的延续通过查找替换更多的列



对于R-known ppl来说应该是一个容易的问题,因为我提供了基本索引的工作解决方案,需要升级,可能很容易...但对我来说很难,因为Iam在学习R的过程中。



从哪里开始:



当我想从(大)表 df1 中的列 testcolnames (小)字典 testdefs 的强>> (通过使 testdefs $ LABMET_ID 等同于 testcolnames 中的列名选择的行)我使用这段代码:

  testcolnames = c(80,116)#... colge (df1),原来长得多

df1 [,testc olnames]< - lapply(testcolnames,function(x){tmpcol< -df1 [,x];
tmpcol [is.na(tmpcol)]< - testdefs $ default [match(x,testdefs $ LABMET_ID)];
tmpcol})

要去哪里: p>

现在 - 我需要升级此解决方案。表 testdefs 将具有相同的 LABMET_ID 的多行不同,只有新的两列( lower / strong> ...选择要替换的值时需要变量 df1 $ rngvalue 的边界。



换句话说 - 要升级此解决方案,不仅从 testdefs (其中 testdefs $ LABMET_ID 等于列名称)中选择行,而是从这些行中选择一行,其中 df1 $ rngvalue testdefs $ lower testdefs $ upper 的边界(如果没有这样的话),请将范围最接近 - 最低或最高,如果字典没有LABMET_ID,我们可以在原始数据中留下NA )。



一个例子:



testdefs

 LABMET_ID,lower上限,默认,不使用,不用2 
30,0,54750,25,80,2#...我们不关心的许多列
46,0,54750,1.45,3.5,0.2
80,0 ,54750,0.03,0.1,0.01
116,0,30,0,09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02

df1: strong>

 rngvalue,80,116
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6

转换为

 rngvalue,80,116
36,0.03,0.135#col80总是被替换为0.03
600000,0.03,0.105#col116需要在范围上决定,这个值大于字典中的所有值,所以最后一个
367,5,0.11#5没有替换,但第二列很好地查找到0.11
90,0.03,6#6不替换


解决方案

自从interva ls没有空格,可以使用 findInterval 。我将使用 plyr 中的 dlply 将查找表更改为包含每个值的断点和默认值的列表。 / p>

  ##将查找表转换为带间隔的列表
库(plyr)
lookup< - dlply(testdefs,。(LABMET_ID),function(x)
list(breaks = c(rbind(x $ lower,x $ upper),x $ upper [length(x $ upper)])[c T,F)],
default = x $ default))

现在看起来像

  lookup [[116]] 
#$ break
#[1] 0 31 366 5476 54750

#$ default
#[1] 0.090 0.135 0.110 0.105

然后,您可以使用以下

  testcolnames = c(80 116)

df1 [,testcolnames]< - lapply(testcolnames,function(x){
tmpcol< - df1 [,x]
defaults& with(lookup [[x]],{
default [pmax(pmin(length(break)-1,findInterval(df1 $ rpgval,break)),1)]
})
tmpcol [is.na(tmpcol)]< - defaults [is.na(tmpcol)]
tmpcol
} )

#rngvalue 80 116
#1 36 0.03 0.135
#2 600000 0.03 0.105
#3 367 5.00 0.110
#4 90 0.03 6.000

findInterval 返回数字低于和高于数字如果rng值超出范围,则为断点。这就是上面代码中的 pmin pmax 的原因。


In this question I need to be able to lookup a value from a dataframe's column not only based on one attribute, but based on more attributes and range comparing against a dictionary. (Yes, this is actually a continuation of a story in R conditional replace more columns by lookup )

It should be easy question for R-known ppl, because I provide working solution for basic indexing, that needs to be upgraded, possibly easily ... but it is very hard for me, because Iam in a process of learning R.

From where to start:

When I do want to replace missing values from columns testcolnames from (big) table df1 according to column default of (small) dictionary testdefs (row selected by making testdefs$LABMET_ID equal to column name from testcolnames), I use this code:

testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer

df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
  tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
  tmpcol  }) 

To where to go:

Now - I need to upgrade this solution. The table testdefs will have (example below) multiple rows of the same LABMET_ID differing only by new two columns called lower and upper ... which need to be the bounds for variable df1$rngvalue when selecting which value to replace.

In another words - to upgrade this solution to not only select the row from testdefs (where testdefs$LABMET_ID equals the column name), but to select from these rows such a row, where df1$rngvalue is in the bounds of testdefs$lower and testdefs$upper (if none such exists, take the range closest - either the lowest or the highest, if the dictionary doesnt have LABMET_ID, we can leave NA in the original data).

An example:

testdefs

"LABMET_ID","lower","upper","default","notuse","notuse2"
30,0,54750,25,80,2            #..."many columns we dont care about"
46,0,54750,1.45,3.5,0.2
80,0,54750,0.03,0.1,0.01
116,0,30,0.09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02

df1:

"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6

to be transformed into:

"rngvalue","80","116"
36,0.03,0.135                   #col80 is always replaced by 0.03
600000,0.03,0.105               #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11                      #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6                       #6 not replaced

解决方案

Since the intervals don't have gaps, you can use findInterval. I would change the lookup table to a list containing the break points and defaults for each value using dlply from plyr.

## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
    list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
         default=x$default))

So, the lookups now look like

lookup[["116"]]
# $breaks
# [1]     0    31   366  5476 54750
# 
# $default
# [1] 0.090 0.135 0.110 0.105

Then, you can do the lookup with the following

testcolnames=c("80","116")

df1[,testcolnames] <- lapply(testcolnames, function(x) {
    tmpcol <- df1[,x]
    defaults <- with(lookup[[x]], {
        default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
    })
    tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
    tmpcol
})

#   rngvalue   80   116
# 1       36 0.03 0.135
# 2   600000 0.03 0.105
# 3      367 5.00 0.110
# 4       90 0.03 6.000

The findInterval returns values below and above the number of breaks if the rngvalue is outside of the range. That is the reason for the pmin and pmax in the code above.

这篇关于R通过查找字典替换列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆