根据另一个df创建新变量 [英] Create new variables based on another df

查看:123
本文介绍了根据另一个df创建新变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试升级R游戏,显然我需要一些指导。我想创建很多变量(准确地说是93),但是我想这样做很聪明。但是我很困。

I'm trying to up my R game, and I clearly need some guidance. I wanna create a lot of variables (93, to be exact), but I wanna do that the smart way. But I'm stuck.

我的问题是:一个数据框(df)包含一些变量,其中包括主要变量,其中包含我的描述变量的词干。另一个数据框(参考),更多是参考表,包含两列-类别和标识它的正则表达式;我只保留了3个条目,但原来保留了93个。

My problem: a dataframe (df) containing some variables, including the "main" one, which contains the stems of my description variable. Another dataframe (reference), more of a reference table, containing two columns - the category and the regex necessary to identify it; I kept only 3 entries, but its 93 originally.

代码:

library(tidyverse)

df <- tibble("FlawType" = c(rep("Medium", 5), rep("Major", 5)),
         "Description" = c("utilizaca indev equip final divers daquel justific aquisica",
                           "utilizaca modal indev licitac aquisica mater previst plan trabalh conveni nomd",
                           "aquisica indev lanch gener alimentici secret municip educaca mont r",
                           "uso indev recurs bloc atenca basic aquisica medic realizaca trat intim prefeit decisa judic",
                           "indici irregular favorec process licitato no aquisica medic farmac basic raza concentraca indevid empr certam",
                           "localizaca bem vist realiz equip fiscalizaca cgu escol municip abril municipi palestin par",
                           "telecentr inat ausenc equip local instalaca equip defeit",
                           "equip local",
                           "equip mater permanent adquir implantaca banc aliment send utiliz outr local simples encontr in loc realiz equip",
                           "mater equip gener alimentici adquir recurs cra por entreg local atend"))

reference <- tibble(var = c("Aquisição indevida", "Equipamentos não localizados", "Despesa irregular"),
                    regex = c("(aquisica.*indev|indev.*aquisica)", "(equip.*local|local.*equip)", "(desp.*irregul|irregul.*desp)"))

kinda 可以在示例df中创建三个新变量,但事实证明它是一个列表,我必须提取它。我以为这不是问题,但是当我尝试运行其原始的df(60k +行)时,它卡住了...

I kinda can create three new variables in my sample df, but it turns out to be a list, and I have to extract it. I thought it wouldn't be a problem, but when I try to run it my original df (60k+ lines), it gets stuck...

这个想法是:使用reference $ var作为每个新变量的名称,并使用关联的正则表达式(reference $ regex)为引用中的每个条目创建一个哑元。

The idea is: use the reference$var as the name of each new variable, using the associated regex (reference$regex) to create a dummy for every entry in the reference.

可以在示例中工作,但不能在原始df 中工作,仅供参考:

Code that works in the sample but not in the original df, just for reference:

varnames <- unique(reference$var)

for(varname in varnames){

  fd[[varname]] <- df %>% 
    mutate(!!paste0(varname) := ifelse(str_detect(df$Description, reference$regex), 1, 0))

}

df <- bind_cols(df, map_df(fd,3))

预先感谢。

推荐答案

可能有更优雅的方法(我不太喜欢必须使用 bind_cols 最后返回原始变量),但这应该可行:

There's probably a more elegant way to do this (I'm not a huge fan of having to use bind_cols at the end to bring back the original variables), but this should work:

add_vars <- function(df, x, y) {
  x <- quo_name(x)
  transmute(df, !! x := ifelse(str_detect(Description, y), 1, 0))
}

bind_cols(df, map2_dfc(reference$var, reference$regex, ~ add_vars(df, .x, .y)))

# A tibble: 10 x 5
   FlawType Description                                                 `Aquisição indevi~ `Equipamentos não loc~ `Despesa irregul~
   <chr>    <chr>                                                                    <dbl>                  <dbl>             <dbl>
 1 Medium   utilizaca indev equip final divers daquel justific aquisica                  1                      0                 0
 2 Medium   utilizaca modal indev licitac aquisica mater previst plan ~                  1                      0                 0
 3 Medium   aquisica indev lanch gener alimentici secret municip educa~                  1                      0                 0
 4 Medium   uso indev recurs bloc atenca basic aquisica medic realizac~                  1                      0                 0
 5 Medium   indici irregular favorec process licitato no aquisica medi~                  1                      0                 0
 6 Major    localizaca bem vist realiz equip fiscalizaca cgu escol mun~                  0                      1                 0
 7 Major    telecentr inat ausenc equip local instalaca equip defeit                     0                      1                 0
 8 Major    equip local                                                                  0                      1                 0
 9 Major    equip mater permanent adquir implantaca banc aliment send ~                  0                      1                 0
10 Major    mater equip gener alimentici adquir recurs cra por entreg ~                  0                      1                 0

这篇关于根据另一个df创建新变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆