让dplyr变异使用公式 [英] Let dplyr mutate use formula
问题描述
我有一个很大的数据集,存储在一个长数据框中。我想提取一些变量上的数据并使用公式生成新数据。所有必要的信息都应从公式中提取。首先,我想使用公式中的信息来过滤相应变量的数据集-为此,我使用了 all.vars()
函数。我还依赖于CRAN上的 formula.tools
包。它用于轻松提取方程式的左侧和右侧(分别为 lhs
和 rhs
)。
I have a large dataset stored in a long dataframe. I would like to extract data on some variables and use a formula to generate new data. All the necessary information should be extracted from the formula. First, I want to use the information in the formula to filter the dataset for the according variables - I use the all.vars()
function for that. I also rely on the formula.tools
package, which is on CRAN. It is used for easy extraction of the left and right hand side of the equation (lhs
and rhs
, respectively).
library(dplyr)
library(reshape2)
library(formula.tools)
set.seed(100)
the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)),
year = c(2000, 2010, 2020),
variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)),
value = rnorm(27, 100, 100))
add_variable <- function(df, equation){
df <- filter(df, variable %in% all.vars(equation))
df <- dcast(df, country + year ~ variable)
df <- mutate_(df, rhs(equation))
# code to keep only the newly generated column
# ...
df <- melt(df, id.vars = c("country", "year"))
}
result <- add_variable(the_data, GDPpC ~ GDP / Population)
新生成的列应命名为 GDPpC
,目前称为 GDP /人口
。如何改善呢?在最后一步中,我还要过滤数据,以便结果中仅包含新生成的数据,然后可以通过 rbind
将其附加到源数据帧。
The newly generated column should be named GDPpC
, currently it is called GDP/Population
. How can this be improved? In a final step I would like to also filter the data so that only the newly generated data is contained in the result, which can then be attached to the source dataframe via rbind
.
推荐答案
那将是一个解决方案吗?
Would that be a solution ?
add_variable <- function(df, equation){
df <- filter(df, variable %in% all.vars(equation))
orig_vars <- unique(df$variable)
df <- dcast(df, country + year ~ variable)
df <- mutate_(df, rhs(equation))
colnames(df)[ncol(df)] <- as.character(lhs(equation))
df <- melt(df, id.vars = c("country", "year"))
df <- filter(df, !variable%in%orig_vars)
}
result <- add_variable(the_data, GDPpC ~ GDP / Population)
result
country year variable value
1 CHN 2000 GDPpC 0.04885649
2 CHN 2010 GDPpC 2.62313658
3 CHN 2020 GDPpC 0.31685382
4 DEU 2000 GDPpC 0.80180998
5 DEU 2010 GDPpC 0.62642877
6 DEU 2020 GDPpC 0.97587188
7 USA 2000 GDPpC 0.26383912
8 USA 2010 GDPpC 1.01303516
9 USA 2020 GDPpC 0.69851501
这篇关于让dplyr变异使用公式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!