Data.table:将功能应用于组,参考每个组中的设置值。将结果列传递给函数 [英] Data.table: Apply function over groups with reference to set value in each group. Pass resulting columns into a function

查看:57
本文介绍了Data.table:将功能应用于组,参考每个组中的设置值。将结果列传递给函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个长格式的数据,该数据将按地理位置分组。我想计算感兴趣变量之一与所有其他感兴趣变量之间的每组差异。我无法在单个数据表语句中弄清楚如何有效地执行此操作,因此解决方法也随之引入了一些新错误(我修复了具有更多解决方法的错误,但在这里也将提供帮助!)。

I have data in a long format which will be grouped by geographies. I want to calculate the difference in each group between one of the variables of interest against all the other variables of interest. I could not figure out how to do this efficiently in a single data table statement so did a workaround which also introduced some new errors along the way (I fixed those with more workarounds but help here would also be appreciated!).

然后我想将结果列传递给ggplot函数,但是无法使用推荐的方法,因此我使用了不推荐使用的方法。

I then want to pass the resulting columns into a ggplot function however cannot get the recommended methods to work so am using a deprecated method.

library(data.table)
library(ggplot2)

set.seed(1)
results <- data.table(geography = rep(1:4, each = 4),
                      variable = rep(c("alpha", "bravo", "charlie", "delta"), 4),
                      statistic = rnorm(16) )

> results[c(1:4,13:16)]
   geography variable   statistic
1:         1    alpha -0.62645381
2:         1    bravo  0.18364332
3:         1  charlie -0.83562861
4:         1    delta  1.59528080
5:         4    alpha -0.62124058
6:         4    bravo -2.21469989
7:         4  charlie  1.12493092
8:         4    delta -0.04493361

base_variable <- "alpha"

从这一点出发,我理想地想写一个简单的代码,按地理位置进行分组,然后以相同的格式返回此表,但每个组中每个变量的统计信息都是(base_variable-变量)。

From this point I ideally want to write a simple piece of code that groups by the geographies, then returns this table in the same format but with the statistic for each variable being (base_variable - variable) in each group.

我不知道如何执行此操作,因此下面是我的解决方法,感谢您提供有关更好方法的任何建议。

I could not figure out how to do this so my workaround is below, any advice on a better method is appreciated.

# Convert to a wide table so we can do the subtraction by rows
results_wide <- dcast(results, geography ~ variable, value.var = "statistic")

   geography      alpha      bravo    charlie       delta
1:         1 -0.6264538  0.1836433 -0.8356286  1.59528080
2:         2  0.3295078 -0.8204684  0.4874291  0.73832471
3:         3  0.5757814 -0.3053884  1.5117812  0.38984324
4:         4 -0.6212406 -2.2146999  1.1249309 -0.04493361

this_is_a_hack <- as.data.table(lapply(results_wide[,-1], function(x) results_wide[, ..base_variable] - x))

   alpha.alpha bravo.alpha charlie.alpha delta.alpha
1:           0  -0.8100971     0.2091748  -2.2217346
2:           0   1.1499762    -0.1579213  -0.4088169
3:           0   0.8811697    -0.9359998   0.1859381
4:           0   1.5934593    -1.7461715  -0.5763070

名称现在被弄乱了我们没有地理位置。为什么这样的名字?另外,还需要重新添加地理位置。

Names are now messed up and we don't have a geography. Why are the names like this? Also, need to re-add geography.

this_is_a_hack[, geography := results_wide[, geography] ]

normalise_these_names <- colnames(this_is_a_hack)
#Regex approach. Hacky and situational. 
new_names <- sub("\\.(.*)", "", normalise_these_names[normalise_these_names != "geography"] )
normalise_these_names[normalise_these_names != "geography"] <- new_names
#Makes use of the fact that geographies will appear last in the data.table, not generalisable approach.
colnames(this_is_a_hack) <- normalise_these_names 

我不再需要基本变量了值是零,所以我尝试删除它,但是我似乎无法以通常的方式做到这一点:

I dont need the base variable anymore as all the values are zero so I try to drop it however I cant seem to do this the usual way I do it:

this_is_a_hack[, ..base_variable := NULL] 
Warning message:
In `[.data.table`(this_is_a_hack, , `:=`(..base_variable, NULL)) :
  Column '..base_variable' does not exist to remove

library(dplyr)
this_is_a_hack <- select(this_is_a_hack, -base_variable)

final_result <- melt(this_is_a_hack, id.vars = "geography")

> final_result[c(1:4,9:12)]
   geography variable      value
1:         1    bravo -0.8100971
2:         2    bravo  1.1499762
3:         3    bravo  0.8811697
4:         4    bravo  1.5934593
5:         1    delta -2.2217346
6:         2    delta -0.4088169
7:         3    delta  0.1859381
8:         4    delta -0.5763070

数据现在可以可视化了。我正在尝试将这些变量传递到绘图函数中,但是与dataframes相比,引用data.table列似乎很困难。显然,您应该使用quosure来将data.table变量传递给函数,但是这只是出错了,所以我改用了不推荐使用的'aes_string'函数-对此也有所帮助。

Data is now ready to be visualised. I'm trying to pass these variables into a plotting function however referencing data.table columns seems to be difficult compared to dataframes. Apparently you should be using quosures to pass data.table variables into functions however this just errored out so I'm using the deprecated 'aes_string' function instead - help on this is also appreciated.

plott <- function(dataset, varx, vary, fillby) {
  # varx <- ensym(varx)
  # vary <- ensym(vary)
  # vary <- ensym(fillby)
  ggplot(dataset, 
         aes_string(x = varx, y = vary, color = fillby)) + 
    geom_point()
}

plott(dataset = final_result,
      varx = "geography",
      vary = "value",
      fillby = "variable")

# Error I get when I try the ensym(...) method in the function:
Don't know how to automatically pick scale for object of type name. Defaulting to continuous. (this message happens 3 times)
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = varx, y = vary, colour = fillby. 
Did you mistype the name of a data column or forget to add stat()?


推荐答案

一个选项是通过创建基于'变量'和'base_variable'元素按'地理'分组的逻辑条件

An option is to subset the 'statistic' by creating a logical condition based on 'variable' with 'base_variable' element grouped by 'geography'

results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
       by = geography][variable != base_variable]
# geography variable       diff
# 1:         1    bravo  0.8100971
# 2:         1  charlie -0.2091748
# 3:         1    delta  2.2217346
# 4:         2    bravo -1.1499762
# 5:         2  charlie  0.1579213
# 6:         2    delta  0.4088169
# 7:         3    bravo -0.8811697
# 8:         3  charlie  0.9359998
# 9:         3    delta -0.1859381
#10:         4    bravo -1.5934593
#11:         4  charlie  1.7461715
#12:         4    delta  0.5763070

这篇关于Data.table:将功能应用于组,参考每个组中的设置值。将结果列传递给函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆