data.table查找值和翻译 [英] data.table Lookup value and translate

查看:101
本文介绍了data.table查找值和翻译的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

像许多我是新的R.我有一个大数据集(500M +行),我已经读入一个data.table logStats 它有如下数据

  head(logStats,15)

时间pid平均
1: 03-10 00:00:00 998 3.570000
2:2014-03-10 00:00:00 11 4.090000
3:2014-03-10 00:00:00 345 3.380000
4:2014-03-10 00:05:00 998 4.866667
5:2014-03-10 00:05:00 11 3.677778
6:2014-03-10 00:05:00 345 4.487500
7:2014-03-10 00:10:00 345 4.833333
8:2014-03-10 00:10:00 998 4.333333
9:2014-03-10 00:10 :00 11 6.977778
10:2014-03-10 00:15:00 345 3.900000
11:2014-03-10 00:15:00 998 3.200000
12:2014-03- 10 00:15:00 11 6.030000
13:2014-03-10 00:20:00 998 4.550000
14:2014-03-10 00:20:00 11 4.030000
15: 2014-03-10 00:20:00 345 6.060000

还有一个非常小的data.table (360行),它有两列,将一个'pid'值解码成一个更友好的阅读。 pid值可以是数字或字符。



例如:

 code> pidLookupTable< -data.table(pid = c(998,11,345),pidName = c(Apple,Bannana,Cinnamon))
/ pre>

它产生:

  pid pidName 
1:998 Apple
2:11 Bannana
3:345肉桂



希望表达式能够向data.table 添加一列,该列具有 pidName 的行 c $ c> pid



我应该得到:

  time pid mean pidNames 
1:2014-03-10 00:00:00 998 3.570000 Apple
2:2014-03-10 00:00:00 11 4.090000香蕉
3:2014-03-10 00:00:00 345 3.380000肉桂
4:2014-03-10 00:05:00 998 4.866667苹果
5:2014-03-10 00 :05:00 11 3.677778香蕉
6:2014-03-10 00:05:00 345 4.487500肉桂
7:2014-03-10 00:10:00 345 4.833333肉桂
8 :2014-03-10 00:10:00 998 4.333333 Apple
9:2014-03-10 00:10:00 11 6.977778香蕉
10:2014-03-10 00:15:00 345 3.900000肉桂
11:2014-03-10 00:15:00 998 3.200000苹果
12:2014-03-10 00:15:00 11 6.030000香蕉
13:2014-03- 10 00:20:00 998 4.550000 Apple
14:2014-03-10 00:20:00 11 4.030000香蕉
15:2014-03-10 00:20:00 345 6.060000肉桂

我写了一个函数:

 code> pidNameLookup< -function(x){
return(pidLookupTable [pidLookupTable $ pid == x,name])
}

然后执行:

  logStats [,pidName:= pidNameLookup (pid)] 

但这只会转换前3个puts code>其余的值:

  logStats [1:1000] 
日期时间pid值时间戳平均值pidName
1:10-03-2014 00:00:12 998 5.5 2014-03-10 00:00:12 3.57 Apple
2:10-03-2014 00:00:17 11 2.1 2014-03-10 00:00:17 4.09 Bannana
3:10-03-2014 00:00:22 345 5.7 2014-03-10 00:00:22 3.38肉桂
4:10 -03-2014 00:00:47 998 1.0 2014-03-10 00:00:47 3.57 NA
5:10-03-2014 00:00:55 11 0.3 2014-03-10 00:00: 55 4.09 NA
---
996:10-03-2014 02:49:37 345 0.7 2014-03-10 02:49:37 5.30 NA
997:10-03- 2014 02:50:01 998 9.9 2014-03-10 02:50:01 5.30 NA
998:10-03-2014 02:50:08 11 7.0 2014-03-10 02:50:08 7.00 NA
999:10-03-2014 02:50:18 345 2.4 2014-03-10 02:50:18 2.40 NA
1000:10-03-2014 02:50:48 998 0.7 2014- 03-10 02:50:48 5.30 NA

并提供警告讯息:

 警告消息:
在pidLookupTable中$ pid == x
更长的对象长度不是短对象长度的倍数

警告消息和不正确的结果意味着我做的完全错误。



帮助!这促使我精神

解决方案

我建议你看看介绍 data.table vignette(datatable-intro)),因为这是 data.table 明确建立。



这将给你你想要什么,应该快得多:

  setkey(logStats,pid)
setkey(pidLookupTable,pid)
logStats [pidLookupTable]
/ pre>

Like many I am new to R. I have a large data set (500M+ rows) which I have fread into a data.table logStats which has data like the following :

 head(logStats,15)

                   time   pid   mean
 1: 2014-03-10 00:00:00   998 3.570000
 2: 2014-03-10 00:00:00   11 4.090000
 3: 2014-03-10 00:00:00   345 3.380000
 4: 2014-03-10 00:05:00   998 4.866667
 5: 2014-03-10 00:05:00   11 3.677778
 6: 2014-03-10 00:05:00   345 4.487500
 7: 2014-03-10 00:10:00   345 4.833333
 8: 2014-03-10 00:10:00   998 4.333333
 9: 2014-03-10 00:10:00   11 6.977778
10: 2014-03-10 00:15:00   345 3.900000
11: 2014-03-10 00:15:00   998 3.200000
12: 2014-03-10 00:15:00   11 6.030000
13: 2014-03-10 00:20:00   998 4.550000
14: 2014-03-10 00:20:00   11 4.030000
15: 2014-03-10 00:20:00   345 6.060000 

There is a second very small data.table (360 rows) which has two columns that decodes a 'pid' value into something a bit more friendly to read. The 'pid' value can be either numerical or a character.

For Example:

pidLookupTable<-data.table(pid=c(998,11,345),pidName=c("Apple","Bannana","Cinnamon"))

which produces :

   pid  pidName
1: 998    Apple
2:  11  Bannana
3: 345 Cinnamon

I want an expression to be able to add a column to data.table logStats which has the pidName for that row pid.

I should get something like :

                   time pid     mean pidNames
 1: 2014-03-10 00:00:00   998 3.570000 Apple
 2: 2014-03-10 00:00:00   11 4.090000 Banana
 3: 2014-03-10 00:00:00   345 3.380000 Cinnamon
 4: 2014-03-10 00:05:00   998 4.866667 Apple
 5: 2014-03-10 00:05:00   11 3.677778 Banana
 6: 2014-03-10 00:05:00   345 4.487500 Cinnamon
 7: 2014-03-10 00:10:00   345 4.833333 Cinnamon
 8: 2014-03-10 00:10:00   998 4.333333 Apple
 9: 2014-03-10 00:10:00   11 6.977778 Banana
10: 2014-03-10 00:15:00   345 3.900000 Cinnamon
11: 2014-03-10 00:15:00   998 3.200000 Apple
12: 2014-03-10 00:15:00   11 6.030000 Banana
13: 2014-03-10 00:20:00   998 4.550000 Apple
14: 2014-03-10 00:20:00   11 4.030000 Banana
15: 2014-03-10 00:20:00   345 6.060000  Cinnamon

I wrote a function :

pidNameLookup<-function(x) { 
  return(pidLookupTable[pidLookupTable$pid==x,name]) 
}

and then ran:

logStats[,pidName:=pidNameLookup(pid)]

But this only converts the first 3 puts NA for the rest of the values :

   logStats[1:1000]
               date     time pid value           timestamp mean  pidName
      1: 10-03-2014 00:00:12 998   5.5 2014-03-10 00:00:12 3.57    Apple
      2: 10-03-2014 00:00:17  11   2.1 2014-03-10 00:00:17 4.09  Bannana
      3: 10-03-2014 00:00:22 345   5.7 2014-03-10 00:00:22 3.38 Cinnamon
      4: 10-03-2014 00:00:47 998   1.0 2014-03-10 00:00:47 3.57       NA
      5: 10-03-2014 00:00:55  11   0.3 2014-03-10 00:00:55 4.09       NA
      ---                                                                
      996: 10-03-2014 02:49:37 345   0.7 2014-03-10 02:49:37 5.30       NA
      997: 10-03-2014 02:50:01 998   9.9 2014-03-10 02:50:01 5.30       NA
      998: 10-03-2014 02:50:08  11   7.0 2014-03-10 02:50:08 7.00       NA
      999: 10-03-2014 02:50:18 345   2.4 2014-03-10 02:50:18 2.40       NA
     1000: 10-03-2014 02:50:48 998   0.7 2014-03-10 02:50:48 5.30       NA 

and gives me the warning message :

Warning message:
In pidLookupTable$pid == x 
  longer object length is not a multiple of shorter object length

The warning message and incorrect result means that I am doing something completely wrong.

Help!! This is driving me mental

解决方案

I suggest you look at the introduction vignette for data.table (vignette("datatable-intro")), since this is something data.table is explicitly built for.

This will give you exactly what you want, and should be much, much faster:

setkey(logStats, "pid")
setkey(pidLookupTable, "pid")
logStats[pidLookupTable]

这篇关于data.table查找值和翻译的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆