根据单独的参考表过滤数据框 [英] Filter dataframe based on a separate reference table

查看:37
本文介绍了根据单独的参考表过滤数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据需要使用参考表以编程方式进行清理.在参考表中,每一行都属于数据中的不同列,并指定用于过滤每个数据变量的值.

I have data that I need to clean up programmatically using a reference table. In the reference table, each row pertains to a different column in the data, and specifies the values by which to filter each data variable.

数据

library(tidyverse)

my_mtcars <-
  mtcars %>% 
  rownames_to_column("cars")

参考表

filter_ref_table <-
  structure(
  list(
    var_name = c(
      "disp",
      "wt",             
      "gear",          
      "carb",
      "mpg",
      "cars",           
      "drat"
    ),
    filtering_values = list(
      NULL,
      structure(
        list(
          min = 3.4,
          max = 3.9,
          values = list(NULL)
        ),
        class = c("tbl_df",
                  "tbl", "data.frame"),
        row.names = c(NA,-1L)
      ),
      structure(
        list(
          min = NA_integer_,
          max = NA_integer_,
          values = list(c(3))
        ),
        class = c("tbl_df",
                  "tbl", "data.frame"),
        row.names = c(NA,-1L)
      ),
      NULL,
      NULL,
      structure(
        list(
          min = NA_integer_,
          max = NA_integer_,
          values = list(c("Maserati Bora", "Chrysler Imperial", "Toyota Corona", "Merc 450SE", 
                          "Lincoln Continental", "Mazda RX4", "Valiant", "Hornet 4 Drive", 
                          "Fiat X1-9", "Camaro Z28", "Fiat 128", "Mazda RX4 Wag", "Datsun 710", 
                          "Merc 240D", "Duster 360"))
        ),
        class = c("tbl_df",
                  "tbl", "data.frame"),
        row.names = c(NA,-1L)
      ),
      NULL
    )
  ),
  row.names = c(NA,-7L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

filter_ref_table

## # A tibble: 7 x 2
##   var_name filtering_values
##   <chr>    <list>          
## 1 disp     <NULL>          
## 2 wt       <tibble [1 x 3]>
## 3 gear     <tibble [1 x 3]>
## 4 carb     <NULL>          
## 5 mpg      <NULL>          
## 6 cars     <tibble [1 x 3]>
## 7 drat     <NULL>    

仔细查看 filter_ref_table 时,我们可以取消嵌套列表列 filtering_values 的嵌套,并查看其内部构造:3列的嵌套小标题:最小值 max .

When taking a closer look at filter_ref_table, we can unnest the list-column filtering_values and see how it's constructed inside: a nested tibble with 3 columns: min, max, and values.

filter_ref_table %>% 
  filter(var_name == "wt") %>% 
  unnest(filtering_values)

## # A tibble: 1 x 4
##   var_name   min   max values
##   <chr>    <dbl> <dbl> <list>
## 1 wt         3.4   3.9 <NULL> ## when there are min/max values we know we should filter by this range

##############################################################################

filter_ref_table %>% 
  filter(var_name == "cars") %>% 
  unnest(filtering_values)        
                                  

## # A tibble: 1 x 4
##   var_name   min   max values    
##   <chr>    <int> <int> <list>    
## 1 cars        NA    NA <chr [15]>   ## when there are values inside "value" we know that we should 
#                              ↑         ## filter to keep any data rows that have either of these values
#                              ↑ 
#   [1] "Maserati Bora"       "Chrysler Imperial"   "Toyota Corona"      
#   [4] "Merc 450SE"          "Lincoln Continental" "Mazda RX4"          
#   [7] "Valiant"             "Hornet 4 Drive"      "Fiat X1-9"          
#   [10] "Camaro Z28"          "Fiat 128"            "Mazda RX4 Wag"      
#   [13] "Datsun 710"          "Merc 240D"           "Duster 360"                  


#############################################################################################
filter_ref_table %>% 
  filter(var_name == "gear") %>% 
  unnest(filtering_values) %>%
  unnest(values)

## # A tibble: 1 x 4
##   var_name   min   max values
##   <chr>    <int> <int>  <dbl>
## 1 gear        NA    NA      3 

因此基于 filter_ref_table ,我们知道我们需要像这样过滤 my_mtcars 中的行:

So based on filter_ref_table, we know that we need to filter rows in my_mtcars like that:

expected_output <- 
  my_mtcars %>%
  filter(cars %in% c("Maserati Bora", "Chrysler Imperial", "Toyota Corona", "Merc 450SE", 
                     "Lincoln Continental", "Mazda RX4", "Valiant", "Hornet 4 Drive", 
                     "Fiat X1-9", "Camaro Z28", "Fiat 128", "Mazda RX4 Wag", "Datsun 710", 
                     "Merc 240D", "Duster 360")) %>%
  filter(gear == 3) %>%
  filter(between(wt, 3.4, 3.9))

> expected_output

##         cars  mpg cyl disp  hp drat   wt  qsec vs am gear carb
## 1    Valiant 18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
## 2 Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
## 3 Camaro Z28 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4

底线-我的问题是:当只用 filter_ref_table 处理以结尾时,如何以编程方式过滤 my_mtcars Expected_output ?

Bottom line -- my question is: how can we filter my_mtcars programmatically when handed with only filter_ref_table to end up with expected_output?

推荐答案

这是一种可能的解决方案

Here's one possible solution

doFilter <- function(data, criteria) {
  retVal <- data
  for (var in criteria %>% pull(var_name)) {
    crit <- criteria %>% filter(var_name == var) %>% unnest()
    minVal <- crit$min
    maxVal <- crit$max
    values <- crit$values
    if (!is.null(minVal)) {
      if (!is.na(minVal)) retVal <- retVal %>% filter(get(var) >= minVal)
    }
    if (!is.null(maxVal)) {
      if (!is.na(maxVal)) retVal <- retVal %>% filter(get(var) <= maxVal)
    }
    if (!is.null(values[[1]])) {
      if (length(values[[1]]) > 0) retVal <- retVal %>% filter(get(var) %in% values[[1]])
    }
  }
  return(retVal)
}

my_mtcars %>% doFilter(filter_ref_table)

给予

        cars  mpg cyl disp  hp drat   wt  qsec vs am gear carb
1    Valiant 18.1   6  225 105 2.76 3.46 20.22  1  0    3    1
2 Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
3 Camaro Z28 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4

关键是使用 get()将字符列名称转换为对象,从而适合tidyverse的NSE.

The key is the use of get() to convert the character column names to objects and thus fit in with the tidyverse's NSE.

顺便说一句,您使用 NA NULL 和零长度列表表示不做任何事情".有点尴尬.

As an aside, your use of NA, NULL and zero-length lists to represent "do nothing" was slightly awkward.

更正并编辑

我上面的原始代码无法根据 value 进行过滤.解决方法是显而易见且容易的.我很抱歉.

My original code above fails to filter on value. The fix is obvious and easy. My apologies.

要在评论中回答OP的问题并扩展我的最后一句话...

To answer OP's question in the comments and to expand on my final sentence...

如果您的过滤器数据集看起来像这样:

If your filter dataset looked something like this:

carList <- c("Maserati Bora", "Chrysler Imperial", "Toyota Corona", "Merc 450SE", 
  "Lincoln Continental", "Mazda RX4", "Valiant", "Hornet 4 Drive", 
  "Fiat X1-9", "Camaro Z28", "Fiat 128", "Mazda RX4 Wag", "Datsun 710", 
  "Merc 240D", "Duster 360")
anotherFilterTable <- tibble(
  var_name = c("disp", "wt", "gear", "carb", "mpg",        "cars", "drat"),
  value=     c(    NA,   NA,      3,     NA,    NA,            NA,     NA),
  min=       c(    NA,  3.4,     NA,     NA,    NA,            NA,     NA),
  max=       c(    NA,  3.9,     NA,     NA,    NA,            NA,     NA),
  choices=   c(    NA,   NA,     NA,     NA,    NA, list(carList),     NA)
) 

anotherFilterTable
# A tibble: 7 x 5
  var_name value   min   max choices   
  <chr>    <dbl> <dbl> <dbl> <list>    
1 disp        NA  NA    NA   <lgl [1]> 
2 wt          NA   3.4   3.9 <lgl [1]> 
3 gear         3  NA    NA   <lgl [1]> 
4 carb        NA  NA    NA   <lgl [1]> 
5 mpg         NA  NA    NA   <lgl [1]> 
6 cars        NA  NA    NA   <chr [15]>
7 drat        NA  NA    NA   <lgl [1]> 

然后,我们删除了一层嵌套,doFilter函数可以变为(这次过滤 value 以及其他条件)...

Then we've removed one level of nesting and the doFilter function can become (this time filtering on value as well as the other criteria)...

doFilter <- function(data, criteria) {
  retVal <- data
  for (var in criteria %>% pull(var_name)) {
    crit <- criteria %>% filter(var_name == var)
    if (!is.na(crit$value)) retVal <- retVal %>% filter(get(var) == crit$value)
    if (!is.na(crit$min)) retVal <- retVal %>% filter(get(var) >= crit$min)
    if (!is.na(crit$max)) retVal <- retVal %>% filter(get(var) <= crit$max)
    if (!is.na(crit$choices)) {
      retVal <- retVal %>% filter(get(var) %in% crit$choices[[1]])
    }
  }
  return(retVal)
}

这有点短,在我看来,它更易于阅读.

This is slightly shorter and, to my mind, easier to read.

此解决方案和OP的原始问题说明均隐含地假设了一组固定的可能的过滤条件.(OP的问题语句也假定使用固定的列名.)为了提供更大的灵活性-也许允许将不同的条件应用于不同数据集中的同一列,然后类似

Both this solution and the OP's original problem statement implicitly assume a fixed set of possible filtering criteria. (The OP's problem statement also assumes fixed column names.) To allow greater flexibility - perhaps allowing for different criteria to be applied to the same column in different datasets, then something like

anotherFilterTable %>% 
  mutate(across(c(value, min, max), as.list)) %>% 
  pivot_longer(
    cols=c(value, min, max, choices),
    names_to="criterion",
    values_to="value"
  ) %>% 
  add_column(source="my_mtcars")
# A tibble: 28 x 4
   var_name criterion value     source   
   <chr>    <chr>     <list>    <chr>    
 1 disp     value     <dbl [1]> my_mtcars
 2 disp     min       <dbl [1]> my_mtcars
 3 disp     max       <dbl [1]> my_mtcars
 4 disp     choices   <lgl [1]> my_mtcars
 5 wt       value     <dbl [1]> my_mtcars
 6 wt       min       <dbl [1]> my_mtcars
 7 wt       max       <dbl [1]> my_mtcars
 8 wt       choices   <lgl [1]> my_mtcars
 9 gear     value     <dbl [1]> my_mtcars
10 gear     min       <dbl [1]> my_mtcars
# … with 18 more rows

可能会. doFilter()将需要进行相应的修改,或者进行相应的修改.我认为这种格式还将允许定义任意过滤条件(例如,仅 mpg 在 mpg 值"),而无需在每次定义新的电位标准时都修改 doFilter()函数.

might do. doFilter() would need to be modified accordingly, or course. I think this format would also allow the definition of arbitrary filtering criteria (for example "only those rows whose mpg is in the 1st quartile of mpg values") to be specified without the need to modify the doFilter() function every time a new poteintial criterion was defined.

一如既往,这是灵活性和复杂性之间的权衡.OP将需要确定最佳位置.

As always, it's a trade-off between flexibility and complexity. The OP will need to decide where the optimum lies.

这篇关于根据单独的参考表过滤数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆