让dplyr变异使用公式 [英] Let dplyr mutate use formula

查看:64
本文介绍了让dplyr变异使用公式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的数据集,存储在一个长数据框中。我想提取一些变量上的数据并使用公式生成新数据。所有必要的信息都应从公式中提取。首先,我想使用公式中的信息来过滤相应变量的数据集-为此,我使用了 all.vars()函数。我还依赖于CRAN上的 formula.tools 包。它用于轻松提取方程式的左侧和右侧(分别为 lhs rhs )。

I have a large dataset stored in a long dataframe. I would like to extract data on some variables and use a formula to generate new data. All the necessary information should be extracted from the formula. First, I want to use the information in the formula to filter the dataset for the according variables - I use the all.vars() function for that. I also rely on the formula.tools package, which is on CRAN. It is used for easy extraction of the left and right hand side of the equation (lhsand rhs, respectively).

library(dplyr)
library(reshape2)
library(formula.tools)

set.seed(100)

the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)),
                       year    = c(2000, 2010, 2020),
                       variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)),
                       value = rnorm(27, 100, 100))

add_variable <- function(df, equation){
  df <- filter(df, variable %in% all.vars(equation))

  df <- dcast(df, country + year ~ variable)

  df <- mutate_(df, rhs(equation))

  # code to keep only the newly generated column
  # ...

  df <- melt(df, id.vars = c("country", "year"))
}

result <- add_variable(the_data, GDPpC ~ GDP / Population)

新生成的列应命名为 GDPpC ,目前称为 GDP /人口。如何改善呢?在最后一步中,我还要过滤数据,以便结果中仅包含新生成的数据,然后可以通过 rbind 将其附加到源数据帧。

The newly generated column should be named GDPpC, currently it is called GDP/Population. How can this be improved? In a final step I would like to also filter the data so that only the newly generated data is contained in the result, which can then be attached to the source dataframe via rbind.

推荐答案

那将是一个解决方案吗?

Would that be a solution ?

add_variable <- function(df, equation){
      df <- filter(df, variable %in% all.vars(equation))
      orig_vars <- unique(df$variable)
      df <- dcast(df, country + year ~ variable)

      df <- mutate_(df, rhs(equation))
      colnames(df)[ncol(df)] <- as.character(lhs(equation))

      df <- melt(df, id.vars = c("country", "year"))
      df <- filter(df, !variable%in%orig_vars)
    }

    result <- add_variable(the_data, GDPpC ~ GDP / Population)
    result
  country year variable      value
1     CHN 2000    GDPpC 0.04885649
2     CHN 2010    GDPpC 2.62313658
3     CHN 2020    GDPpC 0.31685382
4     DEU 2000    GDPpC 0.80180998
5     DEU 2010    GDPpC 0.62642877
6     DEU 2020    GDPpC 0.97587188
7     USA 2000    GDPpC 0.26383912
8     USA 2010    GDPpC 1.01303516
9     USA 2020    GDPpC 0.69851501

这篇关于让dplyr变异使用公式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆