使用查找表中的值替换文本而不使用 for 循环 [英] replace text using values from lookup table without for loop
问题描述
我正在编写一个用于拼写更正的函数.我从维基百科中抓取了 拼写变体 页面并将其转换为表格.我现在想将其用作查找表(拼写)并替换我的文档 (skills.db) 中的值.注意:下面的技能数据框只是一个例子.忽略第二列.我将在简历的过程中更早地进行拼写更正.简历很大,所以我想我会改为分享这个.
我可以使用如下的 for 循环来做到这一点,但是我想知道是否有更好的解决方案
spellings = structure(list(preferred_spellings = c(organisation", acknowledgement",密码",麻醉",分析"),other_spellings = c(组织","致谢"、密码",麻醉",分析")), row.names = c(NA,5L), class = data.frame")Skills.db = structure(list(skills = c(方差分析静态",分析kpi",财务分析"、方差分析"、组织"、分析"、组织"、结果分析"、分析"、数据分析"、分析"、业务分析"、有组织的"、定量分析"、培训需求分析"、分析思维"、分析试验准备"、分析雕像"、谷歌分析"、服务分析"、组织个人"、账户分析"、分析部门工作"、帕累托分析火车"、组织"、比率分析"、统计分析"、项目组织"、整理客户档案"、具有良好的分析能力"、尼尔森分析"、数据分析"、文本分析"、社交分析"、商业智能分析"、市场分析",分析",分析技巧",精湛的分析",财务报表分析"、信用分析"、快速分析"、组织发展",杰出的财务分析",组织设计"、组织会议"、业务分析"、行业分析",金融服务分析",分析",现金流分析",《投资分析》、《技术分析彭博》、《社区组织》、每月财务分析"、费用差异分析"、库存分析"), level1 = c(方差分析静态",分析kpi",财务分析",方差分析"、组织"、分析"、组织"、结果分析"、分析"、数据分析"、分析"、业务分析"、有组织的"、定量分析"、培训需求分析"、分析思考"、分析试验准备"、分析雕像"、谷歌分析"、服务分析"、组织个人"、账户分析"、分析部门工作"、帕累托分析火车"、组织"、比率分析"、统计分析"、项目组织"、整理客户档案"、具有良好的分析能力"、尼尔森分析"、数据分析"、文本分析"、社交分析"、商业智能分析"、市场分析",分析",分析技巧",精湛的分析",财务报表分析"、信用分析"、快速分析"、组织发展",杰出的财务分析",组织设计"、组织会议"、业务分析"、行业分析",金融服务分析",分析",现金流分析",《投资分析》、《技术分析彭博》、《社区组织》、每月财务分析"、费用差异分析"、库存分析")), row.names = c(49L, 65L, 77L, 82L, 155L, 190L, 215L, 244L,246L、260L、287L、300L、311L、323L、349L、356L、378L、386L、447L、607L、622L、664L、686L、766L、824L、832L、895L、922L、928L、949L、1020L、1054L、1079L、1080L、1081L、1088L、1146L、1158L、1228L、1248L、1319L、1366L、1385L、1440L、1468L、1475L、1509L、1554L、1584L、1606L、1635L、1658L、1660L、1696L、1760L、1762L、1798L), class = "data.frame")for(i in 1:nrow(拼写)){Skills.db = Skills.db %>% mutate(TEST = gsub(spellings$other_spellings[i], spellings$preferred_spellings[i], Skills))}
这里有一个方法,使用 Reduce
(很容易变成 purrr::reduce
)来迭代每个拼写并更正它们.
spellings_list <- asplit(spellings, 1)技能.db %>%mutate(TEST = Reduce(function(txt, spl) gsub(spl[2], spl[1], txt), spellings_list, init = Skills), changed = (skills != TEST))# 技能等级 1 TEST 已更改#1 方差分析 静态方差分析 静态方差分析 static FALSE# 2 分析kpi 分析kpi 分析kpi TRUE#3 财务分析 财务分析 财务分析 FALSE#4 方差分析 方差分析 方差分析 FALSE#5 组织 组织 组织 TRUE#6 analysis analysis analysis FALSE# 7 组织 组织 组织 FALSE#8 结果分析结果分析结果分析FALSE#9 分析 分析 分析 FALSE# 10 基准分析 基准分析 基准分析 FALSE# 11 分析分析分析错误#12 业务分析业务分析业务分析FALSE# 13 有组织的有组织的有组织的 FALSE#14 定量分析 定量分析 定量分析 FALSE# 15 列车需求分析 列车需求分析 列车需求分析 FALSE#16 analytic think analytic think analytic think FALSE#17 分析试验准备分析试验准备分析试验准备FALSE#18 分析雕像 分析雕像 分析雕像 TRUE# 19 谷歌分析谷歌分析谷歌分析错误#20 service analysis service analysis service analysis FALSE#21 组织个人 组织个人 组织个人 FALSE#22 账户分析 账户分析 账户分析 FALSE# 23 分析部门工作 分析部门工作 分析部门工作 TRUE#24 帕累托分析训练帕累托分析训练帕累托分析训练FALSE#25组织组织组织TRUE#26 比率分析 比率分析 比率分析 FALSE#27 统计分析 统计分析 统计分析 FALSE#28 项目组织项目组织项目组织TRUE# 29 组织客户文件 组织客户文件 组织客户文件 FALSE# 30 与良好的分析与良好的分析与良好的分析 FALSE# 31 尼尔森分析尼尔森分析尼尔森分析错误# 32 数据分析 数据分析 数据分析 FALSE# 33 文本分析 文本分析 文本分析 FALSE# 34 社交分析 社交分析 社交分析 FALSE# 35 商业智能分析商业智能分析商业智能分析错误#36 市场分析 市场分析 市场分析 FALSE# 37 分析 分析 分析 FALSE#38 解析技能解析技能解析技能FALSE#39 精湛的分析 精湛的分析 精湛的分析 FALSE#40 财务报表分析 财务报表分析 财务报表分析 FALSE#41 信用分析 信用分析 信用分析 FALSE# 42 快速分析 快速分析 快速分析 FALSE#43 组织发展 组织发展 组织发展 TRUE#44 优秀财务分析优秀财务分析优秀财务分析FALSE#45 组织设计 组织设计 组织设计 TRUE# 46 组织会议 组织会议 组织会议 FALSE# 47 商业分析商业分析商业分析假#48 行业分析 行业分析 行业分析 FALSE#49 fs 分析 fs 分析 fs 分析 FALSE# 50 分析 分析 分析 TRUE#51 现金流量分析现金流量分析现金流量分析FALSE#52投资分析投资分析投资分析FALSE#53 技术分析 布隆伯格技术分析 布隆伯格技术分析 布隆伯格 FALSE#54 社区组织 社区组织 社区组织 FALSE#55 月度财务分析 月度财务分析 月度财务分析 FALSE#56 费用差异分析 费用差异分析 费用差异分析 FALSE#57 股票分析 股票分析 股票分析 FALSE
我添加 changed
只是为了试探一下,假设您知道哪些输入应该不同.
演练:
Reduce
将针对每个拼写更正检查整列skills
.其函数的一次迭代的输入将是前一次迭代的输出,这是我们保留更改的必要属性.不幸的是,我们在这里不能轻易使用
Vectorize
,而且Reduce
通常喜欢简单的 2-argument 函数(它不容易Map
-able),所以我将spellings
框架分解为长度为 2 的向量列表:spellings_list <- asplit(spellings, 1)Spellings_list# $`1`# preferred_spellings other_spellings#组织""组织"# $`2`# preferred_spellings other_spellings#确认""致谢"# $`3`# preferred_spellings other_spellings#密码""密码"# $`4`# preferred_spellings other_spellings#麻醉""麻醉"# $`5`# preferred_spellings other_spellings#分析""分析"
这使我们可以更轻松地使用
gsub(spl[1], spl[2], ...)
.Reduce
的艺术在于知道在何处使用哪个参数,以及何时使用init=
.这是一门艺术.当我怀疑自己在哪里喂食时,我会在 anon-func 的开头插入一个browser()
并运行几次减少迭代.>建议:您可能希望将
other_spellings
与\\b
夹在其字符串的任一侧,以防止部分匹配替换.例如,您的spellings
也将替换organizational
,即使它不存在于字面上.虽然那个可能是需要的,但根据您的较大列表,很容易出现误报.(例如,color
/colour
和Colorado
.)
(我最初在 gsub
中交换了 spl[1]
和 spl[2]
.显然还有逻辑"在这方面的艺术:-)
I'm writing a function for spelling correction. I scraped spelling variants page from wikipedia and converted it into a table. I want to now use this as lookup table (spellings) and replace values in my documents (skills.db). NOTE: skills data frame below is just an example. ignore the second column. I will be performing the spelling correction much earlier in the process on resumes. resumes are large, so i thought I'll share this instead.
I can do this using a for loop as below, however I'm wondering if there's a better solution
spellings = structure(list(preferred_spellings = c("organisation", "acknowledgement",
"cypher", "anaesthesia", "analyse"), other_spellings = c(" organization",
" acknowledgment", " cipher", " anesthesia", " analyze")), row.names = c(NA,
5L), class = "data.frame")
skills.db = structure(list(skills = c("variance analysis static", "analyze kpi",
"financial analysis", "variance analysis", "organizational",
"analysis", "organize", "result analysis", "analytic", "datum analysis",
"analytics", "business analysis", "organized", "quantitative analysis",
"train need analysis", "analytic think", "analysis trial preparation",
"analyze statue", "google analytics", "service analysis", "organize individual",
"account analysis", "analyze department work", "pareto analysis train",
"organization", "ratio analysis", "statistical analysis", "project organization",
"organize client's file", "with good analytic", "nielsen analytics",
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics",
"market analysis", "analyse", "analytic skill", "superb analytic",
"financial statement analysis", "credit analysis", "quick analysis",
"organizational development", "outstanding financial analytic",
"organization design", "organize conference", "business analytics",
"industry analysis", "fs analysis", "analyze", "cash flow analysis",
"investment analysis", "technical analysis bloomberg", "community organize",
"monthly financial analysis", "expense variance analysis", "stock analysis"
), level1 = c("variance analysis static", "analyze kpi", "financial analysis",
"variance analysis", "organizational", "analysis", "organize",
"result analysis", "analytic", "datum analysis", "analytics",
"business analysis", "organized", "quantitative analysis", "train need analysis",
"analytic think", "analysis trial preparation", "analyze statue",
"google analytics", "service analysis", "organize individual",
"account analysis", "analyze department work", "pareto analysis train",
"organization", "ratio analysis", "statistical analysis", "project organization",
"organize client's file", "with good analytic", "nielsen analytics",
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics",
"market analysis", "analyse", "analytic skill", "superb analytic",
"financial statement analysis", "credit analysis", "quick analysis",
"organizational development", "outstanding financial analytic",
"organization design", "organize conference", "business analytics",
"industry analysis", "fs analysis", "analyze", "cash flow analysis",
"investment analysis", "technical analysis bloomberg", "community organize",
"monthly financial analysis", "expense variance analysis", "stock analysis"
)), row.names = c(49L, 65L, 77L, 82L, 155L, 190L, 215L, 244L,
246L, 260L, 287L, 300L, 311L, 323L, 349L, 356L, 378L, 386L, 447L,
607L, 622L, 664L, 686L, 766L, 824L, 832L, 895L, 922L, 928L, 949L,
1020L, 1054L, 1079L, 1080L, 1081L, 1088L, 1146L, 1158L, 1228L,
1248L, 1319L, 1366L, 1385L, 1440L, 1468L, 1475L, 1509L, 1554L,
1584L, 1606L, 1635L, 1658L, 1660L, 1696L, 1760L, 1762L, 1798L
), class = "data.frame")
for(i in 1:nrow(spellings)){
skills.db = skills.db %>% mutate(TEST = gsub(spellings$other_spellings[i], spellings$preferred_spellings[i], skills))
}
Here's one method, using Reduce
(which could easily be purrr::reduce
) to iterate over each of the spellings and correct them.
spellings_list <- asplit(spellings, 1)
skills.db %>%
mutate(TEST = Reduce(function(txt, spl) gsub(spl[2], spl[1], txt), spellings_list, init = skills), changed = (skills != TEST))
# skills level1 TEST changed
# 1 variance analysis static variance analysis static variance analysis static FALSE
# 2 analyze kpi analyze kpi analyse kpi TRUE
# 3 financial analysis financial analysis financial analysis FALSE
# 4 variance analysis variance analysis variance analysis FALSE
# 5 organizational organizational organisational TRUE
# 6 analysis analysis analysis FALSE
# 7 organize organize organize FALSE
# 8 result analysis result analysis result analysis FALSE
# 9 analytic analytic analytic FALSE
# 10 datum analysis datum analysis datum analysis FALSE
# 11 analytics analytics analytics FALSE
# 12 business analysis business analysis business analysis FALSE
# 13 organized organized organized FALSE
# 14 quantitative analysis quantitative analysis quantitative analysis FALSE
# 15 train need analysis train need analysis train need analysis FALSE
# 16 analytic think analytic think analytic think FALSE
# 17 analysis trial preparation analysis trial preparation analysis trial preparation FALSE
# 18 analyze statue analyze statue analyse statue TRUE
# 19 google analytics google analytics google analytics FALSE
# 20 service analysis service analysis service analysis FALSE
# 21 organize individual organize individual organize individual FALSE
# 22 account analysis account analysis account analysis FALSE
# 23 analyze department work analyze department work analyse department work TRUE
# 24 pareto analysis train pareto analysis train pareto analysis train FALSE
# 25 organization organization organisation TRUE
# 26 ratio analysis ratio analysis ratio analysis FALSE
# 27 statistical analysis statistical analysis statistical analysis FALSE
# 28 project organization project organization project organisation TRUE
# 29 organize client's file organize client's file organize client's file FALSE
# 30 with good analytic with good analytic with good analytic FALSE
# 31 nielsen analytics nielsen analytics nielsen analytics FALSE
# 32 datum analytics datum analytics datum analytics FALSE
# 33 textual analytics textual analytics textual analytics FALSE
# 34 social analytics social analytics social analytics FALSE
# 35 business intelligence analytics business intelligence analytics business intelligence analytics FALSE
# 36 market analysis market analysis market analysis FALSE
# 37 analyse analyse analyse FALSE
# 38 analytic skill analytic skill analytic skill FALSE
# 39 superb analytic superb analytic superb analytic FALSE
# 40 financial statement analysis financial statement analysis financial statement analysis FALSE
# 41 credit analysis credit analysis credit analysis FALSE
# 42 quick analysis quick analysis quick analysis FALSE
# 43 organizational development organizational development organisational development TRUE
# 44 outstanding financial analytic outstanding financial analytic outstanding financial analytic FALSE
# 45 organization design organization design organisation design TRUE
# 46 organize conference organize conference organize conference FALSE
# 47 business analytics business analytics business analytics FALSE
# 48 industry analysis industry analysis industry analysis FALSE
# 49 fs analysis fs analysis fs analysis FALSE
# 50 analyze analyze analyse TRUE
# 51 cash flow analysis cash flow analysis cash flow analysis FALSE
# 52 investment analysis investment analysis investment analysis FALSE
# 53 technical analysis bloomberg technical analysis bloomberg technical analysis bloomberg FALSE
# 54 community organize community organize community organize FALSE
# 55 monthly financial analysis monthly financial analysis monthly financial analysis FALSE
# 56 expense variance analysis expense variance analysis expense variance analysis FALSE
# 57 stock analysis stock analysis stock analysis FALSE
I added changed
merely for a litmus, assuming you know which of your inputs should be different.
Walkthrough:
Reduce
is going to go over the whole column ofskills
for each of the spellings corrections. The input to one iteration of its function will be the output of the previous iteration, a necessary property so that we preserve the changes.Unfortunately, we can't easily use
Vectorize
here, andReduce
typically likes simple 2-argument functions (it isn't easilyMap
-able), so I break thespellings
frame into a list of length-2 vectors:spellings_list <- asplit(spellings, 1) spellings_list # $`1` # preferred_spellings other_spellings # "organisation" " organization" # $`2` # preferred_spellings other_spellings # "acknowledgement" " acknowledgment" # $`3` # preferred_spellings other_spellings # "cypher" " cipher" # $`4` # preferred_spellings other_spellings # "anaesthesia" " anesthesia" # $`5` # preferred_spellings other_spellings # "analyse" " analyze"
This allows us to more easily use
gsub(spl[1], spl[2], ...)
.The art of
Reduce
is knowing which argument to use where, and when to useinit=
. It's an art. When I put myself in a position where I doubt what is being fed where, I insert abrowser()
in the beginning of the anon-func and run through a couple of iterations of the reduction.Suggestion: you might want to sandwich your
other_spellings
with\\b
on either side of its string, to protect against partial-match replacements. For example, yourspellings
will also replaceorganizational
even though it is not literally present. While that one might be desired, depending on your larger list there could easily be false-positives. (E.g.,color
/colour
andColorado
.)
(Edited: I originally swapped spl[1]
and spl[2]
in the gsub
. Apparently there's also "logic" in the art of this :-)
这篇关于使用查找表中的值替换文本而不使用 for 循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!