根据字符串中多个单词的完全匹配来转换新列 [英] transmute new columns based on exact match of multiple words in string

查看:49
本文介绍了根据字符串中多个单词的完全匹配来转换新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框:

df <- data.frame(
  Otherspp = c("suck SD", "BT", "SD RS", "RSS"),
  Dominantspp = c("OM", "OM", "RSS", "CH"),
  Commonspp = c(" ", " ", " ", "OM"),
  Rarespp = c(" ", " ", "SD", "NP"),
  NP = rep("northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM", 4),
  OM = rep("steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM", 4),
  RSS = rep("redside shiner|REDSIDE SHINER|rs|RS|rss|RSS", 4),
  suck = rep("suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS", 4)
) 

我需要使用填充有常见鱼代码/名称(NP,OM,RSS,suck)的列来评估前四列中的表达式,并基于这些列中的每一列输出1/0(如果表达式)完全符合.我下面的代码与完整的单词不匹配(仅部分匹配),并且提供了不正确的数据(请参见下面的小标题).

I need to use the columns populated with common fish codes/names (NP, OM, RSS, suck) to evaluate the expressions in the first four columns and output a 1/0 based on each of those columns, if the expression is met EXACTLY. The code I have below does not match full words (only partial) and provides incorrect data (see resulting tibble below).

df %>%
  rowwise() %>%
  transmute_at(vars(NP, OM, RSS, suck), 
               funs(case_when(
                 grepl(., Dominantspp) ~ "1",
                 grepl(., Commonspp) ~ "1",
                 grepl(., Rarespp) ~ "1",
                 grepl(., Otherspp) ~ "1",
                 TRUE ~ "0"))) %>%
  ungroup()

结果:看到在第三行中,吸"和"RSS"都收到"1".

Result: see that in row three, both "suck" and "RSS" receive a "1".

# A tibble: 4 x 4
     NP    OM   RSS  suck
  <chr> <chr> <chr> <chr>
1     0     1     0     1
2     0     1     0     0
3     0     0     1     1
4     1     1     1     1

所需的输出:

  NP OM RSS suck
1  0  1   0    1
2  0  1   0    0
3  0  0   1    0
4  1  1   1    0

推荐答案

使用相同方法解决问题的最快方法是使用 \\在每个正则表达式的开头和结尾添加单词边界.b :

The fastest way to solve your problem using your same approach is to add word boundaries to the beginning and end of each of your regexes, with \\b:

df <- data.frame(
  Otherspp = c("suck SD", "BT", "SD RS", "RSS"),
  Dominantspp = c("OM", "OM", "RSS", "CH"),
  Commonspp = c(" ", " ", " ", "OM"),
  Rarespp = c(" ", " ", "SD", "NP"),
  NP = rep("\\b(northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM)\\b", 4),
  OM = rep("\\b(steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM\\b)", 4),
  RSS = rep("\\b(redside shiner|REDSIDE SHINER|rs|RS|rss|RSS)\\b", 4),
  suck = rep("\\b(suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS)\\b", 4),
  stringsAsFactors = FALSE
)

这使正则表达式仅匹配完整单词,这将使您的后续解决方案起作用.

This makes the regular expressions only match full words, which will make your subsequent solution work.

话虽如此,我认为这不一定是解决问题的方法(今天很少建议使用 rowwise(),并且这种方法不能很好地适用于许多鱼类法规).我认为,如果将数据标准化为整齐的格式(每行和代码的组合各占一行),则使用该数据的时间会更短:

Having said that, I don't think this is necessarily the way to approach the problem (rowwise() is rarely recommended today, and this approach won't scale well to many fish codes). I think you'd have an easier time working with this data if you standardized it to a tidy format, with one row per combination of row and code:

library(tidyr)
library(tidytext)

row_codes <- df %>%
  select(Otherspp:Rarespp) %>%
  mutate(row = row_number()) %>%
  gather(type, codes, -row) %>%
  unnest_tokens(code, codes, token = "regex", pattern = " ")

这将导致:

   row        type code
1    1 Dominantspp   om
2    1    Otherspp suck
3    1    Otherspp   sd
4    2 Dominantspp   om
5    2    Otherspp   bt
6    3 Dominantspp  rss
7    3    Otherspp   sd
8    3    Otherspp   rs
9    3     Rarespp   sd
10   4   Commonspp   om
11   4 Dominantspp   ch
12   4    Otherspp  rss
13   4     Rarespp   np

在这一点上,代码更易于使用(您不再需要正则表达式).例如,您可以将它 inner_join 到鱼类代码表中.

At this point, the codes are much easier to work with (you don't need regular expressions anymore). For example, you could inner_join it to a table of the fish codes.

这篇关于根据字符串中多个单词的完全匹配来转换新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆