R通过查找字典替换列 [英] R replacing columns by lookup to dictionary
问题描述
在这个问题中,我需要能够从数据框的列中查找一个不仅仅基于一个属性,而是基于与字典比较的更多属性和范围。
(是的,这实际上是 R条件的故事的延续通过查找替换更多的列)
对于R-known ppl来说应该是一个容易的问题,因为我提供了基本索引的工作解决方案,需要升级,可能很容易...但对我来说很难,因为Iam在学习R的过程中。
从哪里开始:
当我想从(大)表 df1 中的列 testcolnames (小)字典 testdefs 的强>> (通过使 testdefs $ LABMET_ID 等同于 testcolnames 中的列名选择的行)我使用这段代码:
testcolnames = c(80,116)#... colge (df1),原来长得多
df1 [,testc olnames]< - lapply(testcolnames,function(x){tmpcol< -df1 [,x];
tmpcol [is.na(tmpcol)]< - testdefs $ default [match(x,testdefs $ LABMET_ID)];
tmpcol})
要去哪里: p>
现在 - 我需要升级此解决方案。表 testdefs 将具有相同的 LABMET_ID 的多行不同,只有新的两列( lower 和 / strong> ...选择要替换的值时需要变量 df1 $ rngvalue 的边界。
换句话说 - 要升级此解决方案,不仅从 testdefs (其中 testdefs $ LABMET_ID 等于列名称)中选择行,而是从这些行中选择一行,其中 df1 $ rngvalue 在 testdefs $ lower 和 testdefs $ upper 的边界(如果没有这样的话),请将范围最接近 - 最低或最高,如果字典没有LABMET_ID,我们可以在原始数据中留下NA )。
一个例子:
testdefs
LABMET_ID,lower上限,默认,不使用,不用2
30,0,54750,25,80,2#...我们不关心的许多列
46,0,54750,1.45,3.5,0.2
80,0 ,54750,0.03,0.1,0.01
116,0,30,0,09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02
df1: strong>
rngvalue,80,116
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6
转换为
rngvalue,80,116
36,0.03,0.135#col80总是被替换为0.03
600000,0.03,0.105#col116需要在范围上决定,这个值大于字典中的所有值,所以最后一个
367,5,0.11#5没有替换,但第二列很好地查找到0.11
90,0.03,6#6不替换
自从interva ls没有空格,可以使用 findInterval
。我将使用 plyr
中的 dlply
将查找表更改为包含每个值的断点和默认值的列表。 / p>
##将查找表转换为带间隔的列表
库(plyr)
lookup< - dlply(testdefs,。(LABMET_ID),function(x)
list(breaks = c(rbind(x $ lower,x $ upper),x $ upper [length(x $ upper)])[c T,F)],
default = x $ default))
现在看起来像
lookup [[116]]
#$ break
#[1] 0 31 366 5476 54750
#
#$ default
#[1] 0.090 0.135 0.110 0.105
然后,您可以使用以下
testcolnames = c(80 116)
df1 [,testcolnames]< - lapply(testcolnames,function(x){
tmpcol< - df1 [,x]
defaults& with(lookup [[x]],{
default [pmax(pmin(length(break)-1,findInterval(df1 $ rpgval,break)),1)]
})
tmpcol [is.na(tmpcol)]< - defaults [is.na(tmpcol)]
tmpcol
} )
#rngvalue 80 116
#1 36 0.03 0.135
#2 600000 0.03 0.105
#3 367 5.00 0.110
#4 90 0.03 6.000
findInterval
返回数字低于和高于数字如果rng值超出范围,则为断点。这就是上面代码中的 pmin
和 pmax
的原因。
In this question I need to be able to lookup a value from a dataframe's column not only based on one attribute, but based on more attributes and range comparing against a dictionary. (Yes, this is actually a continuation of a story in R conditional replace more columns by lookup )
It should be easy question for R-known ppl, because I provide working solution for basic indexing, that needs to be upgraded, possibly easily ... but it is very hard for me, because Iam in a process of learning R.
From where to start:
When I do want to replace missing values from columns testcolnames from (big) table df1 according to column default of (small) dictionary testdefs (row selected by making testdefs$LABMET_ID equal to column name from testcolnames), I use this code:
testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer
df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
tmpcol })
To where to go:
Now - I need to upgrade this solution. The table testdefs will have (example below) multiple rows of the same LABMET_ID differing only by new two columns called lower and upper ... which need to be the bounds for variable df1$rngvalue when selecting which value to replace.
In another words - to upgrade this solution to not only select the row from testdefs (where testdefs$LABMET_ID equals the column name), but to select from these rows such a row, where df1$rngvalue is in the bounds of testdefs$lower and testdefs$upper (if none such exists, take the range closest - either the lowest or the highest, if the dictionary doesnt have LABMET_ID, we can leave NA in the original data).
An example:
testdefs
"LABMET_ID","lower","upper","default","notuse","notuse2"
30,0,54750,25,80,2 #..."many columns we dont care about"
46,0,54750,1.45,3.5,0.2
80,0,54750,0.03,0.1,0.01
116,0,30,0.09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02
df1:
"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6
to be transformed into:
"rngvalue","80","116"
36,0.03,0.135 #col80 is always replaced by 0.03
600000,0.03,0.105 #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11 #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6 #6 not replaced
Since the intervals don't have gaps, you can use findInterval
. I would change the lookup table to a list containing the break points and defaults for each value using dlply
from plyr
.
## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
default=x$default))
So, the lookups now look like
lookup[["116"]]
# $breaks
# [1] 0 31 366 5476 54750
#
# $default
# [1] 0.090 0.135 0.110 0.105
Then, you can do the lookup with the following
testcolnames=c("80","116")
df1[,testcolnames] <- lapply(testcolnames, function(x) {
tmpcol <- df1[,x]
defaults <- with(lookup[[x]], {
default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
})
tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
tmpcol
})
# rngvalue 80 116
# 1 36 0.03 0.135
# 2 600000 0.03 0.105
# 3 367 5.00 0.110
# 4 90 0.03 6.000
The findInterval
returns values below and above the number of breaks if the rngvalue is outside of the range. That is the reason for the pmin
and pmax
in the code above.
这篇关于R通过查找字典替换列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!