使用R中的模式匹配从现有列创建新列 [英] Create new column from an existing column with pattern matching in R

查看:121
本文介绍了使用R中的模式匹配从现有列创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用模式匹配来添加基于另一列的新列。
我读过这篇文章 a>,但没有获得所需的输出。

我想创建一个基于GreatGroup列的新列(SubOrder)。
我已经尝试了以下内容:

pre $ SubOrder< - rep(NA_character_,length(myData))

SubOrder [grepl(udults,myData,ignore.case = TRUE)]< - Udults
SubOrder [grepl(aquults,myData,ignore.case = TRUE)]< ;aquults
SubOrder [grepl(aqualfs,myData,ignore.case = TRUE)]< - aqualfs
SubOrder [grepl(humods,myData,ignore.case = TRUE)] < - humods
SubOrder [grepl(udalfs,myData,ignore.case = TRUE)] < - udalfs
SubOrder [grepl(orthods,myData, $ c $ SubOrder [grepl(udalfs,myData,ignore.case = TRUE)]< - udalfs
SubOrder [grepl(psamments ,myData,ignore.case = TRUE)]< - psamments
SubOrder [grepl(udepts,myData,ignore.case = TRUE)]< - udepts
SubOrder [ grepl(fluvents,myData,ignore.case = TRUE)]< - fluvents
SubOrder [grepl(aquods,myData,ignore.case = TRUE)]< - aquods

考试我正在寻找任何单词中的udults,例如Hapludults或Paleudults,并且只返回udults。

编辑:如果有人想拍摄alistaire的评论,这是我将使用的搜索模式。

  subOrderNames <-c(Udults,Aquults,Aqualfs,Humods,Udalfs, Orthods,Psamments,Udepts,fluvents)

下面的示例数据。

  myData <-dput(head(test))
structure(list(1:6,SID = c (200502L,200502L,200502L,200502L,
200502L,200502L),Groupdepth = c(11L,12L,13L,14L,21L,22L
),AWC0to10 = c(0.12,0.12,0.12,0.12 ,AWC10至20 = c(0.12,
0.12,0.12,0.12,0.12,0.12),AWC20至50 = c(0.12,0.12,0.12,
0.12,0.12,0.12),AWC50至100 = (0.15,0.15,0.15,0.15,0.15,
0.15),Db3rdbar0到10 = c(1.43,1.43,1.43,1.43,1.43,1.43),
Db3rdbar10到20 = c(1.43,1.43,1.43,1.43,1.43,1.43,1.43,1.43, (1.43,1.43,1.43),Db3rdbar20至50 = c(1.43,
1.43,1.43,1.43,1.43,1.43),Db3rdbar50至100 = c(1.43,
1.43,1.43,1.43,1.43,1.43),HydrRatngPP OrgMatter0to10 = c(1.25,1.25,1.25,1.25,1.25,
1.25),OrgMatter10 to 20 = c(1.25,1.25,1.25 ,1.25,1.25,
1.25),OrgMatter20to50 = c(1.02,1.02,1.02,1.02,1.02,
1.02),OrgMatter50to100 = c(0.12,0.12,0.12,0.12,0.12,
0.12), Clay 0 to 10 = c(8,8,8,8,8,8),Clay 10 to 20 = c(8,
8,8,8,8,8),Clay 20 to 50 = c(9.4,9.4,9.4,9.4, (40,40,40,40,40,40),Sand 0 to 10 = c(85,
85,85,85,85,85),Sand 10 to 20 = c (85,85,85,85,85,85
),Sand20to50 = c(83,83,83,83,83,83),Sand50to100 = c(45.8,
45.8,45.8,45.8 (23,23,23,23,23,23
),Ksat10至20 = 25(c = 6.3,6.3,6.3,6.3),Ksat0至10 = (23,23,23,23,23,23),Ksat20to50 = c(19.7333,
19.7333,19.7333,19.7333,19.7333,19.7333),Ksat50to100 = c(9,
9,9, 9,9,9),TaxClName = c(精细,混合,semiactive,mesic Oxyaquic Hapludults,
细,混合,semiactive,mesic Oxyaquic Hapludults,细,混合,semiactive,mesic Oxyaquic Hapludults ,
好的,混合的,半的,mesic Oxyaquic Hapludults,Fine,mixed,semiactive,mesic Oxyaquic Hapludults,
精细,混合,semiactive,mesic Oxyaqu icap Hapludults),GreatGroup = c(Hapludults,
Hapludults,Hapludults,Hapludults,Hapludults,Hapludults
)),.Names = c( ,SID,Groupdepth,AWC0to10,AWC10to20,
AWC20to50,AWC50to100,Db3rdbar0to10,Db3rdbar10to20,
Db3rdbar20to50,Db3rdbar50to100 HydrRatngPP,OrgMatter0to10,
OrgMatter10to20,OrgMatter20to50,OrgMatter50to100,Clay0to10,
Clay10to20,Clay20to50,Clay50to100,Sand0to10,Sand10to20 ,
Sand20to50,Sand50to100,pHwater0to20,Ksat0to10,Ksat10to20,
Ksat20to50,Ksat50to100,TaxClName,GreatGroup),class = c tbl_df,
data.frame),row.names = c(NA,-6L))


有几个选项,其中一些是我在上面的注释中发布的。
$ b 所有选项都假定替换匹配模式的字符串只是模式。如果你想要其他的东西,他们都可以很容易编辑,包括单独的替换值。



选项1: for + grepl



使用与原始代码相同的代码,但循环避免重复代码:

 #列出模式
pat <-c('udults','aquults','aqualfs','humods' ,'udalfs','orthods','psamments','udepts','fluvents','aquods')

SubOrder< - rep(NA_character_,length(myData))

for(x in 1:length(pat)){
SubOrder [grepl(pat [x],myData $ GreatGroup,ignore.case = TRUE)]< - pat [x]
}






选项2: for + gsub



复制 myData $ GreatGroup ,然后使用 gsub 进行修改。

  myData $ SubOrder<  -  myData $ GreatGroup 
对于粘贴的额外正则表达式包含同一字符串中的字符。 (x in pat){
myData $ SubOrder< - gsub(paste0('。*',x,'。*'),x,myData $ SubOrder,ignore.case = TRUE)
}

请注意, pat 的值将来自 GreatGroup ,而不是 NA 。如果您希望它们是 NA ,请修正它们。

  myData $ SubOrder [!(myData $ SubOrder%in%pat)] < -  NA 






选项3:命名列表+ stringr :: str_replace_all



我最喜欢的,因为它不会't循环,尽管它需要 stringr 包(无论如何,这真是太棒了)。



列表来自 pat ,其中名称是要替换的正则表达式,并且该项目是要匹配的字符串:

  l < -  as.list(pat)
名称(l)< - paste0('。*',pat,'。*')

所以它看起来像

 > l 
$`。* udults。*`
[1]udults

$``* aquults。*`
[1]aquults

$``* aqualfs。*`
[1]aqualfs
......
$ b $然后使用 str_replace_all 来完成所有工作:

  myData $ SubOrder<  -  str_replace_all(myData $ GreatGroup,l)



Boom。

注1: str_replace_all 没有 ignore.case 选项,但您可以在 tolower myData $ GreatGroup $ c>(easy)或重新配置正则表达式(hard)。



注2:选项2 一样,它将不匹配项作为 GreatGroup 中的值,因此请使用该选项末尾的行来返回 NA s,如果你喜欢。


I'm trying to add a new column based on another using pattern matching. I've read this post, but not getting the desired output.

I want to create a new column (SubOrder) based on the GreatGroup column. I have tried the following:

SubOrder <- rep(NA_character_, length(myData))

SubOrder[grepl("udults", myData, ignore.case = TRUE)] <-  "Udults"
SubOrder[grepl("aquults", myData, ignore.case = TRUE)] <-  "Aquults"
SubOrder[grepl("aqualfs", myData, ignore.case = TRUE)] <-  "aqualfs"
SubOrder[grepl("humods", myData, ignore.case = TRUE)] <-  "humods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <-  "udalfs"
SubOrder[grepl("orthods", myData, ignore.case = TRUE)] <-  "orthods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <-  "udalfs"
SubOrder[grepl("psamments", myData, ignore.case = TRUE)] <-  "psamments"
SubOrder[grepl("udepts", myData, ignore.case = TRUE)] <-  "udepts"
SubOrder[grepl("fluvents", myData, ignore.case = TRUE)] <-  "fluvents"
SubOrder[grepl("aquods", myData, ignore.case = TRUE)] <-  "aquods"

For example, I'm looking for "udults" inside any word, such as Hapludults or Paleudults, and return just "udults".

EDIT: If anyone wants to take a shot at alistaire's comment, this is the search patterns I would use.

 subOrderNames <- c("Udults", "Aquults", "Aqualfs", "Humods", "Udalfs", "Orthods", "Psamments", "Udepts", "fluvents")

Example data below.

myData <- dput(head(test))
structure(list(1:6, SID = c(200502L, 200502L, 200502L, 200502L, 
200502L, 200502L), Groupdepth = c(11L, 12L, 13L, 14L, 21L, 22L
), AWC0to10 = c(0.12, 0.12, 0.12, 0.12, 0.12, 0.12), AWC10to20 = c(0.12, 
0.12, 0.12, 0.12, 0.12, 0.12), AWC20to50 = c(0.12, 0.12, 0.12, 
0.12, 0.12, 0.12), AWC50to100 = c(0.15, 0.15, 0.15, 0.15, 0.15, 
0.15), Db3rdbar0to10 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), 
    Db3rdbar10to20 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar20to50 = c(1.43, 
    1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar50to100 = c(1.43, 
    1.43, 1.43, 1.43, 1.43, 1.43), HydrcRatngPP = c(0L, 0L, 0L, 
    0L, 0L, 0L), OrgMatter0to10 = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25), OrgMatter10to20 = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25), OrgMatter20to50 = c(1.02, 1.02, 1.02, 1.02, 1.02, 
    1.02), OrgMatter50to100 = c(0.12, 0.12, 0.12, 0.12, 0.12, 
    0.12), Clay0to10 = c(8, 8, 8, 8, 8, 8), Clay10to20 = c(8, 
    8, 8, 8, 8, 8), Clay20to50 = c(9.4, 9.4, 9.4, 9.4, 9.4, 9.4
    ), Clay50to100 = c(40, 40, 40, 40, 40, 40), Sand0to10 = c(85, 
    85, 85, 85, 85, 85), Sand10to20 = c(85, 85, 85, 85, 85, 85
    ), Sand20to50 = c(83, 83, 83, 83, 83, 83), Sand50to100 = c(45.8, 
    45.8, 45.8, 45.8, 45.8, 45.8), pHwater0to20 = c(6.3, 6.3, 
    6.3, 6.3, 6.3, 6.3), Ksat0to10 = c(23, 23, 23, 23, 23, 23
    ), Ksat10to20 = c(23, 23, 23, 23, 23, 23), Ksat20to50 = c(19.7333, 
    19.7333, 19.7333, 19.7333, 19.7333, 19.7333), Ksat50to100 = c(9, 
    9, 9, 9, 9, 9), TaxClName = c("Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults"), GreatGroup = c("Hapludults", 
    "Hapludults", "Hapludults", "Hapludults", "Hapludults", "Hapludults"
    )), .Names = c("", "SID", "Groupdepth", "AWC0to10", "AWC10to20", 
"AWC20to50", "AWC50to100", "Db3rdbar0to10", "Db3rdbar10to20", 
"Db3rdbar20to50", "Db3rdbar50to100", "HydrcRatngPP", "OrgMatter0to10", 
"OrgMatter10to20", "OrgMatter20to50", "OrgMatter50to100", "Clay0to10", 
"Clay10to20", "Clay20to50", "Clay50to100", "Sand0to10", "Sand10to20", 
"Sand20to50", "Sand50to100", "pHwater0to20", "Ksat0to10", "Ksat10to20", 
"Ksat20to50", "Ksat50to100", "TaxClName", "GreatGroup"), class = c("tbl_df", 
"data.frame"), row.names = c(NA, -6L))

解决方案

A few options, some of which I posted in the comments above.

Note: All options assume the replacement for the strings that match patters are just the pattern. If you want something else, they're all easily editable to include separate replacement values.

Option 1: for + grepl

Using the same code as the original, but looping to avoid repetitive code:

# make a list of patterns
pat <- c('udults', 'aquults', 'aqualfs', 'humods', 'udalfs', 'orthods', 'psamments', 'udepts', 'fluvents', 'aquods')

SubOrder <- rep(NA_character_, length(myData))

for(x in 1:length(pat)){
  SubOrder[grepl(pat[x], myData$GreatGroup, ignore.case = TRUE)] <-  pat[x]
}


Option 2: for + gsub

Build the new column in place by copying myData$GreatGroup and then altering it with gsub. The extra regex pasted on includes characters within the same string.

myData$SubOrder <- myData$GreatGroup
for(x in pat){
  myData$SubOrder <- gsub(paste0('.*', x, '.*'), x, myData$SubOrder, ignore.case = TRUE)
}

Note that values not matched by one of the strings in pat will have the value from GreatGroup, not NA. If you want them to be NA, fix them with

myData$SubOrder[!(myData$SubOrder %in% pat)] <- NA


Option 3: named list + stringr::str_replace_all

My favorite because it doesn't loop, although it requires the stringr package (which is pretty awesome, anyway).

Make a named list from pat, where the name is the regex you want to replace, and the item is the string to match:

l <- as.list(pat)
names(l) <- paste0('.*', pat, '.*')

so it looks like

> l
$`.*udults.*`
[1] "udults"

$`.*aquults.*`
[1] "aquults"

$`.*aqualfs.*`
[1] "aqualfs"
......

Then use str_replace_all to DO IT ALL AT ONCE:

myData$SubOrder <- str_replace_all(myData$GreatGroup, l)

Boom.

Note 1: str_replace_all doesn't have an ignore.case option, but you can wrap myData$GreatGroup in tolower (easy) or reconfigure the regex (hard).

Note 2: Like Option 2, it leaves unmatched entries as the value from GreatGroup, so use the line at the end of that option to go back to NAs, if you like.

这篇关于使用R中的模式匹配从现有列创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆