根据与其他列匹配的部分字符串在数据框中创建新列 [英] Create new column in dataframe based on partial string matching other column

查看:117
本文介绍了根据与其他列匹配的部分字符串在数据框中创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含2列GLGLDESC的数据框,并希望基于列GLDESC内的某些数据添加一个名为KIND的第三列.

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.

数据帧如下:

      GL                             GLDESC
1 515100         Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300                           Bulk Gas
4 539991                     Area Charge In
5 551000        Repairs & Maint-Spare Parts
6 551100                 Supplies-Operating
7 551300                        Consumables

对于数据表的每一行:

  • 如果GLDESC在字符串中的任意位置包含单词Payroll,则我希望KINDPayroll
  • 如果GLDESC在字符串中的任意位置包含单词Gas,则我希望KINDMaterials
  • 在所有其他情况下,我希望KIND成为Other
  • If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
  • If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
  • In all other cases I want KIND to be Other

我在stackoverflow上查找了类似的示例,但找不到任何示例,还在R中查找了switch,grep,apply和正则表达式上的虚拟变量,以尝试仅匹配GLDESC列的一部分,然后填充KIND列,其中包含帐户类型,但无法使其正常工作.

I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.

推荐答案

由于只有两个条件,因此可以使用嵌套的ifelse:

Since you have only two conditions, you can use a nested ifelse:

#random data; it wasn't easy to copy-paste yours  
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10), 
  c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12", 
     "asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))

DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials", 
         ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))

DF
#   GL         GLDESC      KIND
#1   8        e gas l Materials
#2   1  c payroll12 y   Payroll
#3  10      m GaSer v Materials
#4   6       t asdf n     Other
#5   2      w qweaa t     Other
#6   4 r PayROll-12 q   Payroll
#7   9      n asdfg a     Other
#8   5     d GAS--2 w Materials
#9   7     s fghfgh e     Other
#10  3      g qweee k     Other

编辑 2016年10月3日 (..获得了比预期更多的关注)

处理更多模式的可能解决方案可能是遍历所有模式,并在存在匹配项时逐渐减少比较量:

A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:

ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
    stopifnot(length(patterns) == length(replacements))

    ans = rep_len(as.character(fill), length(x))    
    empty = seq_along(x)

    for(i in seq_along(patterns)) {
        greps = grepl(patterns[[i]], x[empty], ...)
        ans[empty[greps]] = replacements[[i]]  
        empty = empty[!greps]
    }

    return(ans)
}

ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll"   "Materials" "Other"     "Other"     "Payroll"   "Other"     "Materials" "Other"     "Other"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"), 
   c("pat1a|pat1b", "pat2", "pat3"), 
   c("1", "2", "3"), fill = "empty")
#[1] "1"     "1"     "3"     "empty"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"), 
   c("pat2", "pat1a|pat1b", "pat3"), 
   c("2", "1", "3"), fill = "empty")
#[1] "2"     "1"     "3"     "empty"

这篇关于根据与其他列匹配的部分字符串在数据框中创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆