R data.table 加速 SI/Metric 转换 [英] R data.table speed up SI / Metric Conversion
问题描述
情况是这样的.我有一个 8500 万行的表,有 18 列.其中三列的值采用公制前缀/SI 表示法(请参阅公制前缀 在维基百科上).
So here's the situation. I've got an 85 Million row table with 18 columns. Three of these columns have values in Metric Prefix / SI notation (See Metric Prefix on Wikipedia).
这意味着我有这样的数字:
This means I have number like :
- .1M 而不是 100000 或 1e+5,或
- 1K 而不是 1000 或 1e+3
示例数据表是
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1: 2014-03-25 12:15:12 58300 3010 44.0 4.5 0.0 0 0 0.8 50 0.8 10K 303 21K 0 a 56
2: 2014-03-25 12:15:12 56328 3010 28.0 12.0 0.0 0 0 0.3 60 0.0 59 62 .1M 0 a 66
3: 2014-03-25 12:15:12 21082 3010 10.0 1.7 0.0 0 0 14.0 72 0.3 4K 208 8K 1 a 80
4: 2014-03-25 12:15:12 59423 3010 12.0 0.0 0.2 0 0 88.0 0 0.0 20 16 71 0 a 26
5: 2014-03-25 12:15:12 59423 3010 9.6 1.4 0.0 0 0 60.0 29 0.2 2K 251 6K 0 a 56
6: 2014-03-25 12:15:12 24193 3010 8.3 1.9 0.0 0 0 9.9 80 0.3 3K 264 8K 1 a 71
7: 2014-03-25 12:15:12 21082 3010 7.1 1.7 0.4 0 0 6.3 83 0.3 3K 197 7K 0 a 71
8: 2014-03-25 12:15:12 59423 3010 4.6 1.2 0.0 0 0 57.0 37 0.1 998 81 7K 0 a 118
我修改了 Hans-Jörg Bibiko 编写的一个函数,他用它来修改 ggplot2 比例.如果您有兴趣,请参阅网站此处.我最终使用的功能是:
I modified a function written by Hans-Jörg Bibiko who used it to modify ggplot2 scales. See website here if you are iterested. The function I ended up using is :
sitor <- function(x)
{
conv <- paste("E", c(seq(-24 ,-3, by=3), -2, -1, 0, seq(3, 24, by=3)), sep="")
names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","","K","M","G","T","P","E","Z","Y")
x <- as.character(x)
num <- function(x) as.numeric(
paste(
strsplit(x,"[A-z|µ]")[[1]][3],
ifelse(substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1) == "",
"",
conv[substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1)]
),
sep=""
)
)
return(lapply(x,num))
}
我将其应用于数据表以更新 3 列,例如
I apply it to by data table to update 3 columns like
temp[ ,`:=`(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ]
我已使用
setkeyv(temp,c("V1","V2","V3","V18"))
任何 61 分钟后,我仍然在这里等待结果......鉴于我的数据大小即将增长 4 到 5 倍,有关如何加快此转换的一些提示将非常方便.
Any 61 minutes later I am still here waiting for a result... Some tips on how to speed up this conversion would be really handy given that my data size is about to grow 4 to 5 times.
推荐答案
为什么不试试 sitools
库?
library(data.table)
dt<-data.table(var = sample(x=1:1e5, size=1e6, replace=T))
library(sitools)
> system.time(dt[, var2 := f2si(var)])
user system elapsed
10.08 0.09 10.89
这是一个基于data.table的函数,它从sitools
包中反转f2si
:
this is a data.table based function that reverse f2si
from sitools
package:
si2f<-function(x){
if(is.numeric(x)) return(x)
require(data.table)
dt<-data.table(lab=c("y","z","a","f","p","n","µ","m","c","d","", "da", "h", "k","M","G","T","P","E","Z","Y"),
mul=c(1e-24, 1e-21, 1e-18, 1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e-2, 1e-1, 1L, 10L, 1e2, 1e3, 1e6, 1e9, 1e12, 1e15, 1e18, 1e21, 1e24),
key="lab")
res<-as.numeric(gsub("[^0-9|\\.]","", x))
x<-gsub("[0-9]|\\s+|\\.","", x)
.subset2(dt[.(x)], "mul")*res
}
> system.time(dt[, var3 := si2f(var2)])
user system elapsed
13.18 0.03 13.31
> dt[, all.equal(var,var3)]
[1] TRUE
这篇关于R data.table 加速 SI/Metric 转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!