使用R清除数据时使用正则表达式逗号 [英] Regex comma use in data cleaning with R
本文介绍了使用R清除数据时使用正则表达式逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
在我之前的一个问题(Creating adjacency matrix with dirty dataset)中,我能够清除几乎所有的数据。谢谢你们,你们这些出色的程序员。然而,当我试图了解游乐场如何工作时,我继续遇到逗号问题。
数据集最初看起来像-
Species Association Year
<fctr> <chr> <dbl>
1 RC SKS/BW NA
2 BW Sykes, rc NA
3 SKS Babo/bw NA
4 RC baboon, mangabey NA
5 Mang red colobus, bw, sykes NA
6 SKS babo/red duiker NA
11 BW r/c monkeys 12
21 RC b/w colobus 12
31 SKS b/w colobus/R/c monkeys 12
41 BW sykes/R/c monkeys 12
51 RC sykes/b/w colobus 12
61 BABO - 12
7 SKS - 12
8 RC - 12
9 SKS r/c monkeys 12
10 RC sykes monkeys 12
53 BW sykes,b/w colobus 12
57 BW r/c monkeys,bw 12
58 Mang sykes,R/c monkeys 12
Dput-
dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS",
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC", "BW", "BW", "Mang"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey",
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys", "sykes,b/w colobus", "r/c monkeys,bw", "sykes,R/c monkeys"), year = c(NA, NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", "2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", "7", "8", "9", "10", "53", "57", "58"), class = "data.frame")
为了进行清理,我创建了一个字典,然后使用正则表达式捕获关联列中除最后三行之外的所有变化,因为它们是用‘,’而不是‘/’分隔的
dict <- read.table(header=TRUE, text='
from to
"BABO" BABO
"yellow baboon" BABO
"BW" BW
"bw colobus" BW
"Bw" BW
"bw" BW
"Bw colobus" BW
"B/W COLOBUS" BW
"RC" RC
"RED COLOBUS" RC
"rc monkeys" RC
"Red colobus" RC
"R/C MONKEYS" RC
"Rc monkeys" RC
"MANGABEY" MANG
"MANGA" MANG
"mangabeys" MANG
"SKS" SKS
"SYKES" SKS
"SYKES MONKEYS" SKS
"sykes" SKS
"SYKES MONKEY" SKS
"RED DUIKER" RD
"Red duiker" RD
"Red Duiker + V . Fresh dung" RD
')
regex <- '(?<=\w{2})\/|,\s'
spf <- "%s"
data.frame(from=
sprintf(spf,
sort(unique(unlist(
strsplit(toupper(dat$Association), regex, perl=TRUE)))))) |>
print(row.names=FALSE)
res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
lapply((x) dict[match(x, dict$from), ]$to) |>
sapply(toString) |>
{(.) replace(., . == ".", NA)}() |>
data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
setNames(c('association', 'site', 'species', 'year')) |>
subset(select=c(3, 1, 2, 4))
给我一个最终数据框-
Species Association Site Year
<fctr> <chr> <chr> <dbl>
1 RC SKS, BW Protected NA
2 BW SKS, RC Protected NA
3 SKS BABO, BW Protected NA
4 RC BABO, MANG Protected NA
5 MANG RC, BW, SKS Protected NA
6 SKS BABO, RD Protected NA
7 BW RC Protected 12
8 RC BW Protected 12
9 SKS BW, RC Protected 12
10 BW SKS, RC Protected 12
11 RC SKS, BW Protected 12
12 BABO NA Protected 12
13 SKS NA Protected 12
14 RC NA Protected 12
15 SKS RC Protected 12
16 RC SKS Protected 12
17 BW NA Protected 12
18 BW NA Protected 12
19 MANG NA Protected 12
我希望包括最后三行以读取正确的关联(即SKS,BW;RC,BW;SKS,RC),但我正在阅读的有关regex的所有内容都将逗号用作表达式的一部分,而不是字符串中找到的内容的一部分。有没有办法把它包括进去,这样它就会给出正确的输出?我仍然是regex的新手,也是R的新手。非常感谢您的帮助。
推荐答案
问题出在您的词典上。使用tidyverse,如下所示:
library(tidyverse)
dict1 <- dict %>%
add_row(from = 'BABOON', to = 'BABO') %>%
add_row(from='.', to = NA) %>%
add_row(from = '/', to = ',')
mutate(from = toupper(from))%>%
distinct() %>%
arrange(desc(nchar(from)))
dat %>%
mutate(Association = str_replace_all(toupper(Association),
fixed(setNames(dict1$to, dict1$from))),
Site = 'Protected')
Species Association year Site
1 RC SKS,BW NA Protected
2 BW SKS, RC NA Protected
3 SKS BABO,BW NA Protected
4 RC BABO, MANG NA Protected
5 Mang RC, BW, SKS NA Protected
6 SKS BABO,RD NA Protected
11 BW RC 12 Protected
21 RC BW 12 Protected
31 SKS BW,RC 12 Protected
41 BW SKS,RC 12 Protected
51 RC SKS,BW 12 Protected
61 BABO <NA> 12 Protected
7 SKS <NA> 12 Protected
8 RC <NA> 12 Protected
9 SKS RC 12 Protected
10 RC SKS MONKEYS 12 Protected
53 BW SKS,BW 12 Protected
57 BW RC,BW 12 Protected
58 Mang SKS,RC 12 Protected
这篇关于使用R清除数据时使用正则表达式逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文