data.table查找值和翻译 [英] data.table Lookup value and translate
问题描述
像许多我是新的R.我有一个大数据集(500M +行),我已经读入一个data.table logStats
它有如下数据
head(logStats,15)
时间pid平均
1: 03-10 00:00:00 998 3.570000
2:2014-03-10 00:00:00 11 4.090000
3:2014-03-10 00:00:00 345 3.380000
4:2014-03-10 00:05:00 998 4.866667
5:2014-03-10 00:05:00 11 3.677778
6:2014-03-10 00:05:00 345 4.487500
7:2014-03-10 00:10:00 345 4.833333
8:2014-03-10 00:10:00 998 4.333333
9:2014-03-10 00:10 :00 11 6.977778
10:2014-03-10 00:15:00 345 3.900000
11:2014-03-10 00:15:00 998 3.200000
12:2014-03- 10 00:15:00 11 6.030000
13:2014-03-10 00:20:00 998 4.550000
14:2014-03-10 00:20:00 11 4.030000
15: 2014-03-10 00:20:00 345 6.060000
还有一个非常小的data.table (360行),它有两列,将一个'pid'值解码成一个更友好的阅读。 pid值可以是数字或字符。
例如:
code> pidLookupTable< -data.table(pid = c(998,11,345),pidName = c(Apple,Bannana,Cinnamon))
/ pre>
它产生:
pid pidName
1:998 Apple
2:11 Bannana
3:345肉桂
希望表达式能够向data.table添加一列,该列具有
pidName
的行c $ c> pid
。
我应该得到:
time pid mean pidNames
1:2014-03-10 00:00:00 998 3.570000 Apple
2:2014-03-10 00:00:00 11 4.090000香蕉
3:2014-03-10 00:00:00 345 3.380000肉桂
4:2014-03-10 00:05:00 998 4.866667苹果
5:2014-03-10 00 :05:00 11 3.677778香蕉
6:2014-03-10 00:05:00 345 4.487500肉桂
7:2014-03-10 00:10:00 345 4.833333肉桂
8 :2014-03-10 00:10:00 998 4.333333 Apple
9:2014-03-10 00:10:00 11 6.977778香蕉
10:2014-03-10 00:15:00 345 3.900000肉桂
11:2014-03-10 00:15:00 998 3.200000苹果
12:2014-03-10 00:15:00 11 6.030000香蕉
13:2014-03- 10 00:20:00 998 4.550000 Apple
14:2014-03-10 00:20:00 11 4.030000香蕉
15:2014-03-10 00:20:00 345 6.060000肉桂
我写了一个函数:
code> pidNameLookup< -function(x){
return(pidLookupTable [pidLookupTable $ pid == x,name])
}
然后执行:
logStats [,pidName:= pidNameLookup (pid)]
但这只会转换前3个puts
code>其余的值:
logStats [1:1000]
日期时间pid值时间戳平均值pidName
1:10-03-2014 00:00:12 998 5.5 2014-03-10 00:00:12 3.57 Apple
2:10-03-2014 00:00:17 11 2.1 2014-03-10 00:00:17 4.09 Bannana
3:10-03-2014 00:00:22 345 5.7 2014-03-10 00:00:22 3.38肉桂
4:10 -03-2014 00:00:47 998 1.0 2014-03-10 00:00:47 3.57 NA
5:10-03-2014 00:00:55 11 0.3 2014-03-10 00:00: 55 4.09 NA
---
996:10-03-2014 02:49:37 345 0.7 2014-03-10 02:49:37 5.30 NA
997:10-03- 2014 02:50:01 998 9.9 2014-03-10 02:50:01 5.30 NA
998:10-03-2014 02:50:08 11 7.0 2014-03-10 02:50:08 7.00 NA
999:10-03-2014 02:50:18 345 2.4 2014-03-10 02:50:18 2.40 NA
1000:10-03-2014 02:50:48 998 0.7 2014- 03-10 02:50:48 5.30 NA
并提供警告讯息:
警告消息:
在pidLookupTable中$ pid == x
更长的对象长度不是短对象长度的倍数
警告消息和不正确的结果意味着我做的完全错误。
帮助!这促使我精神
解决方案我建议你看看介绍
data.table
(
vignette(datatable-intro)
),因为这是data.table
明确建立。
这将给你你想要什么,应该快得多:
setkey(logStats,pid)
/ pre>
setkey(pidLookupTable,pid)
logStats [pidLookupTable]
Like many I am new to R. I have a large data set (500M+ rows) which I have fread into a data.table
logStats
which has data like the following :head(logStats,15) time pid mean 1: 2014-03-10 00:00:00 998 3.570000 2: 2014-03-10 00:00:00 11 4.090000 3: 2014-03-10 00:00:00 345 3.380000 4: 2014-03-10 00:05:00 998 4.866667 5: 2014-03-10 00:05:00 11 3.677778 6: 2014-03-10 00:05:00 345 4.487500 7: 2014-03-10 00:10:00 345 4.833333 8: 2014-03-10 00:10:00 998 4.333333 9: 2014-03-10 00:10:00 11 6.977778 10: 2014-03-10 00:15:00 345 3.900000 11: 2014-03-10 00:15:00 998 3.200000 12: 2014-03-10 00:15:00 11 6.030000 13: 2014-03-10 00:20:00 998 4.550000 14: 2014-03-10 00:20:00 11 4.030000 15: 2014-03-10 00:20:00 345 6.060000
There is a second very small data.table (360 rows) which has two columns that decodes a 'pid' value into something a bit more friendly to read. The 'pid' value can be either numerical or a character.
For Example:
pidLookupTable<-data.table(pid=c(998,11,345),pidName=c("Apple","Bannana","Cinnamon"))
which produces :
pid pidName 1: 998 Apple 2: 11 Bannana 3: 345 Cinnamon
I want an expression to be able to add a column to data.table
logStats
which has thepidName
for that rowpid
.I should get something like :
time pid mean pidNames 1: 2014-03-10 00:00:00 998 3.570000 Apple 2: 2014-03-10 00:00:00 11 4.090000 Banana 3: 2014-03-10 00:00:00 345 3.380000 Cinnamon 4: 2014-03-10 00:05:00 998 4.866667 Apple 5: 2014-03-10 00:05:00 11 3.677778 Banana 6: 2014-03-10 00:05:00 345 4.487500 Cinnamon 7: 2014-03-10 00:10:00 345 4.833333 Cinnamon 8: 2014-03-10 00:10:00 998 4.333333 Apple 9: 2014-03-10 00:10:00 11 6.977778 Banana 10: 2014-03-10 00:15:00 345 3.900000 Cinnamon 11: 2014-03-10 00:15:00 998 3.200000 Apple 12: 2014-03-10 00:15:00 11 6.030000 Banana 13: 2014-03-10 00:20:00 998 4.550000 Apple 14: 2014-03-10 00:20:00 11 4.030000 Banana 15: 2014-03-10 00:20:00 345 6.060000 Cinnamon
I wrote a function :
pidNameLookup<-function(x) { return(pidLookupTable[pidLookupTable$pid==x,name]) }
and then ran:
logStats[,pidName:=pidNameLookup(pid)]
But this only converts the first 3 puts
NA
for the rest of the values :logStats[1:1000] date time pid value timestamp mean pidName 1: 10-03-2014 00:00:12 998 5.5 2014-03-10 00:00:12 3.57 Apple 2: 10-03-2014 00:00:17 11 2.1 2014-03-10 00:00:17 4.09 Bannana 3: 10-03-2014 00:00:22 345 5.7 2014-03-10 00:00:22 3.38 Cinnamon 4: 10-03-2014 00:00:47 998 1.0 2014-03-10 00:00:47 3.57 NA 5: 10-03-2014 00:00:55 11 0.3 2014-03-10 00:00:55 4.09 NA --- 996: 10-03-2014 02:49:37 345 0.7 2014-03-10 02:49:37 5.30 NA 997: 10-03-2014 02:50:01 998 9.9 2014-03-10 02:50:01 5.30 NA 998: 10-03-2014 02:50:08 11 7.0 2014-03-10 02:50:08 7.00 NA 999: 10-03-2014 02:50:18 345 2.4 2014-03-10 02:50:18 2.40 NA 1000: 10-03-2014 02:50:48 998 0.7 2014-03-10 02:50:48 5.30 NA
and gives me the warning message :
Warning message: In pidLookupTable$pid == x longer object length is not a multiple of shorter object length
The warning message and incorrect result means that I am doing something completely wrong.
Help!! This is driving me mental
解决方案I suggest you look at the introduction vignette for
data.table
(vignette("datatable-intro")
), since this is somethingdata.table
is explicitly built for.This will give you exactly what you want, and should be much, much faster:
setkey(logStats, "pid") setkey(pidLookupTable, "pid") logStats[pidLookupTable]
这篇关于data.table查找值和翻译的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!