按保存在变量中的名称引用data.table列 [英] Referring to data.table columns by names saved in variables

查看:186
本文介绍了按保存在变量中的名称引用data.table列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table 是一个神奇的R包,我正在使用它在我正在开发的图书馆。到目前为止,一切都进行得很好,除了一个并发症。与使用保存在变量中的名称(例如,对于数据帧,例如,数据帧)相比,看起来更困难(与常规数据帧相比)参考 data.table colname =col; df [df [,colname]< 5,colname] = 0 )。



可能最复杂的是,在 data.table 中,语法上的语法缺乏一致性。在某些情况下, eval(colname) get(colname),甚至 c colname)似乎工作。在其他情况下, DT [,colname,with = F] 是解决方案。然而在其他情况下,例如 set() subset()发现了一个解决方案。最后,尽管前面已经讨论了一个极端的,尽管也很常见的用例(传递列名称到data.table编程)和提出的解决方案,虽然显然是做他们的工作,似乎不是特别可读...



也许我是复杂的东西许多?如果任何人可以用不同的常见情况下使用变量来记录一个快速的备忘表以引用 data.table 列名,我将非常感谢。



UPDATE:



names:

  x.short = subset(x,abs(dist)< = 100)
set ,其中(x $ val <10),val,0)

$ c> distcol =dist, valcol =val。使用 distcol valcol 但不是 dist val

解决方案

要在你的 j 表达式中进行复杂的操作,你应该使用 eval quote 。在当前版本的 data.table 中的一个问题是 eval 的环境并不总是被正确处理 - a href =http://stackoverflow.com/questions/15913832/eval-and-quote-in-data-table>在data.table中的eval和quote (注意:已经更新了基于对包的更新来回答) - 并且当前的修复是将 .SD 添加到 eval 。就我可以从几个测试,我已经运行这不影响速度(例如 .SD [1] j 和你的使用 eval 通常在 i (其中 .SD



另一个问题是赋值,你必须有字符串。我知道一种方法从引用的表达式中提取字符串名称 - 它不漂亮,但它的工作原理。这里有一个将所有东西组合在一起的示例:

  x = data.table(dist = c(1:10),val = c 1:10))
distcol = quote(dist)
valcol = quote(val)

x [eval(valcol) 5,
capture.output(str(distcol,give.head = F)):= eval(distcol)* sum(eval(distcol,.SD))]

请注意,我在一个 eval()中添加 .SD distcol),但不会是如果我把它从其他 eval



另一个选项(我不喜欢,因为我认为它更容易出错)是使用 get


$ b b

  diststr =dist
valstr =val

x [get(valstr) 5,c(diststr):= get(diststr)* sum(get(diststr))]


data.table is a fantastic R package and I am using it in a library I am developing. So far all is going very well, except for one complication. It seems to be much more difficult (compared to the conventional data frames) to refer to data.table columns using names saved in variables (as for data frames would be, for example: colname="col"; df[df[,colname]<5,colname]=0).

Perhaps what complicates the things most is the apparent lack of consistency of syntax on this in data.table. In some cases, eval(colname) and get(colname), or even c(colname) seem to work. In others, DT[,colname, with=F] is the solution. Yet in others, such as, for example, the set() and subset() functions, I haven't found a solution at all. Finally, an extreme, albeit also quite common use case was discussed earlier (passing column names to data.table programmatically) and the proposed solutions, albeit apparently doing their job, did not seem particularly readable...

Perhaps I am complicating things too much? If anyone could jot down a quick cheatsheet for referring to data.table column names using variables for different common scenarios, I would be very grateful.

UPDATE:

Some specific examples that work provided I can hard code column names:

x.short = subset(x, abs(dist)<=100)
set(x, which(x$val<10), "val", 0) 

Now assume distcol="dist", valcol="val". What is the best way to do the above using distcol and valcol, but not dist and val?

解决方案

If you are going to be doing complicated operations inside your j expressions, you should probably use eval and quote. One problem with that in current version of data.table is that the environment of eval is not always correctly processed - eval and quote in data.table (Note: There has been an update to that answer based on an update to the package.) - and the current fix for that is to add .SD to eval. As far as I can tell from a few tests that I've run this doesn't affect speed (the way e.g. having .SD[1] in j would).

Interestingly this issue only plagues the j and you'll be fine using eval normally in i (where .SD is not available anyway).

The other problem is assignment, and there you have to have strings. I know one way to extract the string name from a quoted expression - it's not pretty, but it works. Here's an example combining everything together:

x = data.table(dist = c(1:10), val = c(1:10))
distcol = quote(dist)
valcol = quote(val)

x[eval(valcol) < 5,
  capture.output(str(distcol, give.head = F)) := eval(distcol)*sum(eval(distcol, .SD))]

Note how I was ok not adding .SD in one eval(distcol), but won't be if I take it out of the other eval.

Another option (which I dislike, as I think it's much more error-prone), is to use get:

diststr = "dist"
valstr = "val"

x[get(valstr) < 5, c(diststr) := get(diststr)*sum(get(diststr))]

这篇关于按保存在变量中的名称引用data.table列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆