使用R清除数据时使用正则表达式逗号 [英] Regex comma use in data cleaning with R

查看:0
本文介绍了使用R清除数据时使用正则表达式逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我之前的一个问题(Creating adjacency matrix with dirty dataset)中,我能够清除几乎所有的数据。谢谢你们,你们这些出色的程序员。然而,当我试图了解游乐场如何工作时,我继续遇到逗号问题。

数据集最初看起来像-

Species    Association                  Year
<fctr>     <chr>                        <dbl>
1   RC     SKS/BW                       NA  
2   BW     Sykes, rc                    NA
3   SKS    Babo/bw                      NA
4   RC     baboon, mangabey             NA
5   Mang   red colobus, bw, sykes       NA
6   SKS    babo/red duiker              NA
11  BW     r/c monkeys                  12
21  RC     b/w colobus                  12
31  SKS    b/w colobus/R/c monkeys      12
41  BW     sykes/R/c monkeys            12
51  RC     sykes/b/w colobus            12
61  BABO   -                            12
7   SKS    -                            12
8   RC     -                            12
9   SKS    r/c monkeys                  12
10  RC     sykes monkeys                12
53  BW     sykes,b/w colobus            12
57  BW     r/c monkeys,bw               12
58  Mang   sykes,R/c monkeys            12

Dput-

dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS", 
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC", "BW", "BW", "Mang"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey", 
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus", 
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus", 
".", ".", ".", "r/c monkeys", "sykes monkeys", "sykes,b/w colobus", "r/c monkeys,bw", "sykes,R/c monkeys"), year = c(NA, NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", "2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", "7", "8", "9", "10", "53", "57", "58"), class = "data.frame")

为了进行清理,我创建了一个字典,然后使用正则表达式捕获关联列中除最后三行之外的所有变化,因为它们是用‘,’而不是‘/’分隔的

dict <- read.table(header=TRUE, text='
from  to
"BABO"  BABO
"yellow baboon"  BABO
"BW"  BW
"bw colobus" BW
"Bw" BW
"bw" BW
"Bw colobus" BW
"B/W COLOBUS" BW
"RC"  RC
"RED COLOBUS"  RC
"rc monkeys" RC
"Red colobus" RC
"R/C MONKEYS" RC
"Rc monkeys" RC
"MANGABEY"  MANG
"MANGA" MANG
"mangabeys" MANG
"SKS"  SKS
"SYKES"  SKS
"SYKES MONKEYS" SKS
"sykes" SKS
"SYKES MONKEY" SKS
"RED DUIKER"  RD
"Red duiker" RD
"Red Duiker + V . Fresh dung" RD
')

regex <- '(?<=\w{2})\/|,\s'

spf <- "%s"

data.frame(from=
             sprintf(spf, 
                     sort(unique(unlist(
                       strsplit(toupper(dat$Association), regex, perl=TRUE)))))) |> 
                       print(row.names=FALSE)

res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
  lapply((x) dict[match(x, dict$from), ]$to) |>
  sapply(toString) |>
  {(.) replace(., . == ".", NA)}() |>
  data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
  setNames(c('association', 'site', 'species', 'year')) |>
  subset(select=c(3, 1, 2, 4))

给我一个最终数据框-

Species    Association       Site           Year
<fctr>     <chr>             <chr>          <dbl>
1   RC     SKS, BW           Protected      NA
2   BW     SKS, RC           Protected      NA
3   SKS    BABO, BW          Protected      NA
4   RC     BABO, MANG        Protected      NA
5   MANG   RC, BW, SKS       Protected      NA
6   SKS    BABO, RD          Protected      NA
7   BW     RC                Protected      12
8   RC     BW                Protected      12
9   SKS    BW, RC            Protected      12
10  BW     SKS, RC           Protected      12
11  RC     SKS, BW           Protected      12
12  BABO   NA                Protected      12
13  SKS    NA                Protected      12
14  RC     NA                Protected      12
15  SKS    RC                Protected      12
16  RC     SKS               Protected      12
17  BW     NA                Protected      12
18  BW     NA                Protected      12
19  MANG   NA                Protected      12
我希望包括最后三行以读取正确的关联(即SKS,BW;RC,BW;SKS,RC),但我正在阅读的有关regex的所有内容都将逗号用作表达式的一部分,而不是字符串中找到的内容的一部分。有没有办法把它包括进去,这样它就会给出正确的输出?我仍然是regex的新手,也是R的新手。非常感谢您的帮助。

推荐答案

问题出在您的词典上。使用tidyverse,如下所示:

library(tidyverse)
 dict1 <- dict %>%
  add_row(from = 'BABOON', to = 'BABO') %>%
  add_row(from='.', to = NA) %>%
  add_row(from = '/', to = ',')
  mutate(from = toupper(from))%>%
  distinct() %>%
  arrange(desc(nchar(from)))

dat %>%
  mutate(Association = str_replace_all(toupper(Association), 
                              fixed(setNames(dict1$to, dict1$from))),
         Site = 'Protected')


  Species Association year      Site
1       RC      SKS,BW   NA Protected
2       BW     SKS, RC   NA Protected
3      SKS     BABO,BW   NA Protected
4       RC  BABO, MANG   NA Protected
5     Mang RC, BW, SKS   NA Protected
6      SKS     BABO,RD   NA Protected
11      BW          RC   12 Protected
21      RC          BW   12 Protected
31     SKS       BW,RC   12 Protected
41      BW      SKS,RC   12 Protected
51      RC      SKS,BW   12 Protected
61    BABO        <NA>   12 Protected
7      SKS        <NA>   12 Protected
8       RC        <NA>   12 Protected
9      SKS          RC   12 Protected
10      RC SKS MONKEYS   12 Protected
53      BW      SKS,BW   12 Protected
57      BW       RC,BW   12 Protected
58    Mang      SKS,RC   12 Protected

这篇关于使用R清除数据时使用正则表达式逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆