根据与其他列匹配的部分字符串在数据框中创建新列 [英] Create new column in dataframe based on partial string matching other column
问题描述
我有一个包含2列GL
和GLDESC
的数据框,并希望基于列GLDESC
内的某些数据添加一个名为KIND
的第三列.
I have a dataframe with 2 columns GL
and GLDESC
and want to add a 3rd column called KIND
based on some data that is inside of column GLDESC
.
数据帧如下:
GL GLDESC
1 515100 Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300 Bulk Gas
4 539991 Area Charge In
5 551000 Repairs & Maint-Spare Parts
6 551100 Supplies-Operating
7 551300 Consumables
对于数据表的每一行:
- 如果
GLDESC
在字符串中的任意位置包含单词Payroll
,则我希望KIND
为Payroll
- 如果
GLDESC
在字符串中的任意位置包含单词Gas
,则我希望KIND
为Materials
- 在所有其他情况下,我希望
KIND
成为Other
- If
GLDESC
contains the wordPayroll
anywhere in the string then I wantKIND
to bePayroll
- If
GLDESC
contains the wordGas
anywhere in the string then I wantKIND
to beMaterials
- In all other cases I want
KIND
to beOther
我在stackoverflow上查找了类似的示例,但找不到任何示例,还在R
中查找了switch,grep,apply和正则表达式上的虚拟变量,以尝试仅匹配GLDESC
列的一部分,然后填充KIND
列,其中包含帐户类型,但无法使其正常工作.
I looked for similar examples on stackoverflow but could not find any, also looked in R
for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC
column and then fill the KIND
column with the kind of account but was unable to make it work.
推荐答案
由于只有两个条件,因此可以使用嵌套的ifelse
:
Since you have only two conditions, you can use a nested ifelse
:
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
编辑 2016年10月3日 (..获得了比预期更多的关注)
处理更多模式的可能解决方案可能是遍历所有模式,并在存在匹配项时逐渐减少比较量:
A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"
这篇关于根据与其他列匹配的部分字符串在数据框中创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!