使用Tidyverse Join更新/替换数据框中的值 [英] Update/Replace Values in Dataframe with Tidyverse Join

查看:51
本文介绍了使用Tidyverse Join更新/替换数据框中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用查找表中的(正确)值更新/替换主数据集中的NA的最有效方法是什么?这是很普通的操作!类似的问题似乎没有整齐的解决方案.

约束:1)请假定比给定的示例有大量的缺失值和更大的查找表.因此按大小写替换操作是不切实际的(没有 case_when if_else 等)

2)查找表并不具有主数据帧的所有值,而仅具有替换值.

Tidyverse解决方案更可取.类似的问题似乎没有整洁的解决方案.

 库(tidyverse)###主数据框###df1<-tibble(state_abbrev = state.abb [1:10],state_name = c(state.name [1:5],rep(NA,3),state.name [9:10]),值=样本(500:1200,10,replace = TRUE))#>#小动作:10 x 3#>state_abbrev state_name值#>< chr>< chr>< int>#>1阿拉巴马州551#>2 AK阿拉斯加765#>3亚利桑那州508#>4阿肯色州756#>5 CA加利福尼亚741#>6 CO< NA>1100#>7 CT< NA>719#>8 DE< NA>874#>9 FL佛罗里达749#>10佐治亚州580###查找数据框###lookup_df<-tibble(state_abbrev = state.abb [6:8],state_name = state.name [6:8])#>#小动作:3 x 2#>state_abbrev state_name#>< chr>< chr>#>1科罗拉多州#>2 CT康涅狄格#>3 DE特拉华州 

理想情况下,left_join将为缺失值提供替换选项.las ...

  left_join(df1,lookup_df)#>通过= c("state_abbrev","state_name")加入#>#小动作:10 x 3#>state_abbrev state_name值#>< chr>< chr>< int>#>1阿拉巴马州551#>2 AK阿拉斯加765#>3亚利桑那州508#>4阿肯色州756#>5 CA加利福尼亚741#>6 CO< NA>1100#>7 CT< NA>719#>8 DE< NA>874#>9 FL佛罗里达749#>10佐治亚州580 

```

data.table 的最新连接总是更快(请注意日志时间范围).

更新联接修改数据对象时,每次运行基准测试时都会使用一个新副本.

What is the most efficient way to update/replace NAs in main dataset with (correct) values in a lookup table? This is such a common operation! Similar questions do not seem to have tidy solutions.

Constraints: 1) Please assume a large number of missing values and bigger lookup table than the example given. So case-wise replacement operations would be impractical (no case_when, if_else, etc.)

2)The lookup table does not have all values of main dataframe, only the replacement ones.

Tidyverse solution answer much preferred. Similar questions do not seem to have tidy solutions.

library(tidyverse)

### Main Dataframe ###
df1 <- tibble(
  state_abbrev = state.abb[1:10],
  state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
  value = sample(500:1200, 10, replace=TRUE)
)


#> # A tibble: 10 x 3
#>    state_abbrev state_name value
#>    <chr>        <chr>      <int>
#>  1 AL           Alabama      551
#>  2 AK           Alaska       765
#>  3 AZ           Arizona      508
#>  4 AR           Arkansas     756
#>  5 CA           California   741
#>  6 CO           <NA>        1100
#>  7 CT           <NA>         719
#>  8 DE           <NA>         874
#>  9 FL           Florida      749
#> 10 GA           Georgia      580


### Lookup Dataframe ###
lookup_df <- tibble(
  state_abbrev = state.abb[6:8],
  state_name = state.name[6:8]
)

#> # A tibble: 3 x 2
#>   state_abbrev state_name 
#>   <chr>        <chr>      
#> 1 CO           Colorado   
#> 2 CT           Connecticut
#> 3 DE           Delaware

Ideally, a left_join would have a replacement option for missing values. Alas...

left_join(df1, lookup_df)
#> Joining, by = c("state_abbrev", "state_name")
#> # A tibble: 10 x 3
#>    state_abbrev state_name value
#>    <chr>        <chr>      <int>
#>  1 AL           Alabama      551
#>  2 AK           Alaska       765
#>  3 AZ           Arizona      508
#>  4 AR           Arkansas     756
#>  5 CA           California   741
#>  6 CO           <NA>        1100
#>  7 CT           <NA>         719
#>  8 DE           <NA>         874
#>  9 FL           Florida      749
#> 10 GA           Georgia      580

```

Created on 2018-07-28 by the reprex package (v0.2.0).

解决方案

Picking up Alistaire's and Nettle's suggestions and transforming into a working solution

df1 %>% 
  left_join(lookup_df, by = "state_abbrev") %>% 
  mutate(state_name = coalesce(state_name.x, state_name.y)) %>% 
  select(-state_name.x, -state_name.y)

# A tibble: 10 x 3
   state_abbrev value state_name 
   <chr>        <int> <chr>      
 1 AL             671 Alabama    
 2 AK             501 Alaska     
 3 AZ            1030 Arizona    
 4 AR             694 Arkansas   
 5 CA             881 California 
 6 CO             821 Colorado   
 7 CT             742 Connecticut
 8 DE             665 Delaware   
 9 FL             948 Florida    
10 GA             790 Georgia


The OP has stated to prefer a "tidyverse" solution. However, update joins are already available with the data.table package:

library(data.table)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1

    state_abbrev  state_name value
 1:           AL     Alabama  1103
 2:           AK      Alaska  1036
 3:           AZ     Arizona   811
 4:           AR    Arkansas   604
 5:           CA  California   868
 6:           CO    Colorado  1129
 7:           CT Connecticut   819
 8:           DE    Delaware  1194
 9:           FL     Florida   888
10:           GA     Georgia   501

Benchmark

library(bench)
bm <- press(
  na_share = c(0.1, 0.5, 0.9),
  n_row = length(state.abb) * 2 * c(1, 100, 10000),
  {
    n_na <- na_share * length(state.abb)
    set.seed(1)
    na_idx <- sample(length(state.abb), n_na)
    tmp <- data.table(state_abbrev = state.abb, state_name = state.name)
    lookup_df <-tmp[na_idx] 
    tmp[na_idx, state_name := NA]
    df0 <- as_tibble(tmp[sample(length(state.abb), n_row, TRUE)])
    mark(
      dplyr = {
        df1 <- copy(df0)
        df1 <- df1 %>% 
          left_join(lookup_df, by = "state_abbrev") %>% 
          mutate(state_name = coalesce(state_name.x, state_name.y)) %>% 
          select(-state_name.x, -state_name.y)
        df1
      },
      upd_join = {
        df1 <- copy(df0)
        setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
        df1
      }
    )
  }
)
ggplot2::autoplot(bm)

data.table's upate join is always faster (note the log time scale).

As the update join modifies the data object, a fresh copy is used for each benchmark run.

这篇关于使用Tidyverse Join更新/替换数据框中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆