通过保存在变量中的名称引用 data.table 列 [英] Referring to data.table columns by names saved in variables

查看:16
本文介绍了通过保存在变量中的名称引用 data.table 列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table 是一个很棒的 R 包,我在我正在开发的库中使用它.到目前为止,一切进展顺利,除了一个并发症.使用保存在变量中的名称来引用 data.table 列似乎要困难得多(与传统的数据帧相比)(对于数据帧,例如:colname="col"; df[df[,colname]<5,colname]=0).

data.table is a fantastic R package and I am using it in a library I am developing. So far all is going very well, except for one complication. It seems to be much more difficult (compared to the conventional data frames) to refer to data.table columns using names saved in variables (as for data frames would be, for example: colname="col"; df[df[,colname]<5,colname]=0).

也许最复杂的事情是 data.table 中的语法明显缺乏一致性.在某些情况下,eval(colname)get(colname),甚至 c(colname) 似乎都可以工作.在其他情况下,DT[,colname, with=F] 是解决方案.然而在其他方面,例如 set()subset() 函数,我根本没有找到解决方案.最后,前面讨论了一个极端但也很常见的用例(传递列名以编程方式到 data.table )和建议的解决方案,虽然显然在做他们的工作,但似乎并不特别可读......

Perhaps what complicates the things most is the apparent lack of consistency of syntax on this in data.table. In some cases, eval(colname) and get(colname), or even c(colname) seem to work. In others, DT[,colname, with=F] is the solution. Yet in others, such as, for example, the set() and subset() functions, I haven't found a solution at all. Finally, an extreme, albeit also quite common use case was discussed earlier (passing column names to data.table programmatically) and the proposed solutions, albeit apparently doing their job, did not seem particularly readable...

也许我把事情复杂化了?如果有人可以记下一份快速备忘单,以使用不同常见场景的变量来引用 data.table 列名,我将不胜感激.

Perhaps I am complicating things too much? If anyone could jot down a quick cheatsheet for referring to data.table column names using variables for different common scenarios, I would be very grateful.

更新:

如果我可以对列名进行硬编码,则可以使用一些具体示例:

Some specific examples that work provided I can hard code column names:

x.short = subset(x, abs(dist)<=100)
set(x, which(x$val<10), "val", 0) 

现在假设 distcol="dist", valcol="val".使用 distcolvalcol 而不是 distval 的最佳方法是什么?

Now assume distcol="dist", valcol="val". What is the best way to do the above using distcol and valcol, but not dist and val?

推荐答案

如果你要在 j 表达式中进行复杂的操作,你应该使用 eval报价.当前版本 data.table 的一个问题是 eval 的环境并不总是被正确处理 - eval and quote in data.table (注意:该答案已根据包的更新进行了更新.) - 和当前的解决方法是将 .SD 添加到 eval.据我所知,我已经运行了一些测试,这不会影响速度(例如在 j 中使用 .SD[1] 的方式).

If you are going to be doing complicated operations inside your j expressions, you should probably use eval and quote. One problem with that in current version of data.table is that the environment of eval is not always correctly processed - eval and quote in data.table (Note: There has been an update to that answer based on an update to the package.) - and the current fix for that is to add .SD to eval. As far as I can tell from a few tests that I've run this doesn't affect speed (the way e.g. having .SD[1] in j would).

有趣的是,这个问题只会困扰 j 并且您可以在 i 中正常使用 eval (其中 .SD 无论如何都不可用).

Interestingly this issue only plagues the j and you'll be fine using eval normally in i (where .SD is not available anyway).

另一个问题是赋值,你必须有字符串.我知道一种从带引号的表达式中提取字符串名称的方法——它不漂亮,但它有效.这是一个将所有内容组合在一起的示例:

The other problem is assignment, and there you have to have strings. I know one way to extract the string name from a quoted expression - it's not pretty, but it works. Here's an example combining everything together:

x = data.table(dist = c(1:10), val = c(1:10))
distcol = quote(dist)
valcol = quote(val)

x[eval(valcol) < 5,
  capture.output(str(distcol, give.head = F)) := eval(distcol)*sum(eval(distcol, .SD))]

请注意我是如何在一个 eval(distcol) 中不添加 .SD 的,但如果我从另一个 eval 中取出它就不会.

Note how I was ok not adding .SD in one eval(distcol), but won't be if I take it out of the other eval.

另一种选择是使用 get:

diststr = "dist"
valstr = "val"

x[get(valstr) < 5, c(diststr) := get(diststr)*sum(get(diststr))]

这篇关于通过保存在变量中的名称引用 data.table 列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆