使用查找表中的值替换文本而不使用 for 循环 [英] replace text using values from lookup table without for loop

查看:24
本文介绍了使用查找表中的值替换文本而不使用 for 循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个用于拼写更正的函数.我从维基百科中抓取了 拼写变体 页面并将其转换为表格.我现在想将其用作查找表(拼写)并替换我的文档 (skills.db) 中的值.注意:下面的技能数据框只是一个例子.忽略第二列.我将在简历的过程中更早地进行拼写更正.简历很大,所以我想我会改为分享这个.

我可以使用如下的 for 循环来做到这一点,但是我想知道是否有更好的解决方案

spellings = structure(list(preferred_spellings = c(organisation", acknowledgement",密码",麻醉",分析"),other_spellings = c(组织","致谢"、密码",麻醉",分析")), row.names = c(NA,5L), class = data.frame")Skills.db = structure(list(skills = c(方差分析静态",分析kpi",财务分析"、方差分析"、组织"、分析"、组织"、结果分析"、分析"、数据分析"、分析"、业务分析"、有组织的"、定量分析"、培训需求分析"、分析思维"、分析试验准备"、分析雕像"、谷歌分析"、服务分析"、组织个人"、账户分析"、分析部门工作"、帕累托分析火车"、组织"、比率分析"、统计分析"、项目组织"、整理客户档案"、具有良好的分析能力"、尼尔森分析"、数据分析"、文本分析"、社交分析"、商业智能分析"、市场分析",分析",分析技巧",精湛的分析",财务报表分析"、信用分析"、快速分析"、组织发展",杰出的财务分析",组织设计"、组织会议"、业务分析"、行业分析",金融服务分析",分析",现金流分析",《投资分析》、《技术分析彭博》、《社区组织》、每月财务分析"、费用差异分析"、库存分析"), level1 = c(方差分析静态",分析kpi",财务分析",方差分析"、组织"、分析"、组织"、结果分析"、分析"、数据分析"、分析"、业务分析"、有组织的"、定量分析"、培训需求分析"、分析思考"、分析试验准备"、分析雕像"、谷歌分析"、服务分析"、组织个人"、账户分析"、分析部门工作"、帕累托分析火车"、组织"、比率分析"、统计分析"、项目组织"、整理客户档案"、具有良好的分析能力"、尼尔森分析"、数据分析"、文本分析"、社交分析"、商业智能分析"、市场分析",分析",分析技巧",精湛的分析",财务报表分析"、信用分析"、快速分析"、组织发展",杰出的财务分析",组织设计"、组织会议"、业务分析"、行业分析",金融服务分析",分析",现金流分析",《投资分析》、《技术分析彭博》、《社区组织》、每月财务分析"、费用差异分析"、库存分析")), row.names = c(49L, 65L, 77L, 82L, 155L, 190L, 215L, 244L,246L、260L、287L、300L、311L、323L、349L、356L、378L、386L、447L、607L、622L、664L、686L、766L、824L、832L、895L、922L、928L、949L、1020L、1054L、1079L、1080L、1081L、1088L、1146L、1158L、1228L、1248L、1319L、1366L、1385L、1440L、1468L、1475L、1509L、1554L、1584L、1606L、1635L、1658L、1660L、1696L、1760L、1762L、1798L), class = "data.frame")for(i in 1:nrow(拼写)){Skills.db = Skills.db %>% mutate(TEST = gsub(spellings$other_spellings[i], spellings$preferred_spellings[i], Skills))}

解决方案

这里有一个方法,使用 Reduce(很容易变成 purrr::reduce)来迭代每个拼写并更正它们.

spellings_list <- asplit(spellings, 1)技能.db %>%mutate(TEST = Reduce(function(txt, spl) gsub(spl[2], spl[1], txt), spellings_list, init = Skills), changed = (skills != TEST))# 技能等级 1 TEST 已更改#1 方差分析 静态方差分析 静态方差分析 static FALSE# 2 分析kpi 分析kpi 分析kpi TRUE#3 财务分析 财务分析 财务分析 FALSE#4 方差分析 方差分析 方差分析 FALSE#5 组织 组织 组织 TRUE#6 analysis analysis analysis FALSE# 7 组织 组织 组织 FALSE#8 结果分析结果分析结果分析FALSE#9 分析 分析 分析 FALSE# 10 基准分析 基准分析 基准分析 FALSE# 11 分析分析分析错误#12 业务分析业务分析业务分析FALSE# 13 有组织的有组织的有组织的 FALSE#14 定量分析 定量分析 定量分析 FALSE# 15 列车需求分析 列车需求分析 列车需求分析 FALSE#16 analytic think analytic think analytic think FALSE#17 分析试验准备分析试验准备分析试验准备FALSE#18 分析雕像 分析雕像 分析雕像 TRUE# 19 谷歌分析谷歌分析谷歌分析错误#20 service analysis service analysis service analysis FALSE#21 组织个人 组织个人 组织个人 FALSE#22 账户分析 账户分析 账户分析 FALSE# 23 分析部门工作 分析部门工作 分析部门工作 TRUE#24 帕累托分析训练帕累托分析训练帕累托分析训练FALSE#25组织组织组织TRUE#26 比率分析 比率分析 比率分析 FALSE#27 统计分析 统计分析 统计分析 FALSE#28 项目组织项目组织项目组织TRUE# 29 组织客户文件 组织客户文件 组织客户文件 FALSE# 30 与良好的分析与良好的分析与良好的分析 FALSE# 31 尼尔森分析尼尔森分析尼尔森分析错误# 32 数据分析 数据分析 数据分析 FALSE# 33 文本分析 文本分析 文本分析 FALSE# 34 社交分析 社交分析 社交分析 FALSE# 35 商业智能分析商业智能分析商业智能分析错误#36 市场分析 市场分析 市场分析 FALSE# 37 分析 分析 分析 FALSE#38 解析技能解析技能解析技能FALSE#39 精湛的分析 精湛的分析 精湛的分析 FALSE#40 财务报表分析 财务报表分析 财务报表分析 FALSE#41 信用分析 信用分析 信用分析 FALSE# 42 快速分析 快速分析 快速分析 FALSE#43 组织发展 组织发展 组织发展 TRUE#44 优秀财务分析优秀财务分析优秀财务分析FALSE#45 组织设计 组织设计 组织设计 TRUE# 46 组织会议 组织会议 组织会议 FALSE# 47 商业分析商业分析商业分析假#48 行业分析 行业分析 行业分析 FALSE#49 fs 分析 fs 分析 fs 分析 FALSE# 50 分析 分析 分析 TRUE#51 现金流量分析现金流量分析现金流量分析FALSE#52投资分析投资分析投资分析FALSE#53 技术分析 布隆伯格技术分析 布隆伯格技术分析 布隆伯格 FALSE#54 社区组织 社区组织 社区组织 FALSE#55 月度财务分析 月度财务分析 月度财务分析 FALSE#56 费用差异分析 费用差异分析 费用差异分析 FALSE#57 股票分析 股票分析 股票分析 FALSE

我添加 changed 只是为了试探一下,假设您知道哪些输入应该不同.

演练:

  1. Reduce 将针对每个拼写更正检查整列 skills.其函数的一次迭代的输入将是前一次迭代的输出,这是我们保留更改的必要属性.

  2. 不幸的是,我们在这里不能轻易使用 Vectorize,而且 Reduce 通常喜欢简单的 2-argument 函数(它不容易 Map-able),所以我将 spellings 框架分解为长度为 2 的向量列表:

    spellings_list <- asplit(spellings, 1)Spellings_list# $`1`# preferred_spellings other_spellings#组织""组织"# $`2`# preferred_spellings other_spellings#确认""致谢"# $`3`# preferred_spellings other_spellings#密码""密码"# $`4`# preferred_spellings other_spellings#麻醉""麻醉"# $`5`# preferred_spellings other_spellings#分析""分析"

    这使我们可以更轻松地使用 gsub(spl[1], spl[2], ...).

  3. Reduce 的艺术在于知道在何处使用哪个参数,以及何时使用 init=.这是一门艺术.当我怀疑自己在哪里喂食时,我会在 anon-func 的开头插入一个 browser() 并运行几次减少迭代.>

  4. 建议:您可能希望将 other_spellings\\b 夹在其字符串的任一侧,以防止部分匹配替换.例如,您的 spellings 也将替换 organizational,即使它不存在于字面上.虽然那个可能是需要的,但根据您的较大列表,很容易出现误报.(例如,color/colourColorado.)

(我最初在 gsub 中交换了 spl[1]spl[2].显然还有逻辑"在这方面的艺术:-)

I'm writing a function for spelling correction. I scraped spelling variants page from wikipedia and converted it into a table. I want to now use this as lookup table (spellings) and replace values in my documents (skills.db). NOTE: skills data frame below is just an example. ignore the second column. I will be performing the spelling correction much earlier in the process on resumes. resumes are large, so i thought I'll share this instead.

I can do this using a for loop as below, however I'm wondering if there's a better solution

spellings = structure(list(preferred_spellings = c("organisation", "acknowledgement", 
"cypher", "anaesthesia", "analyse"), other_spellings = c(" organization", 
" acknowledgment", " cipher", " anesthesia", " analyze")), row.names = c(NA, 
5L), class = "data.frame")

skills.db = structure(list(skills = c("variance analysis static", "analyze kpi", 
"financial analysis", "variance analysis", "organizational", 
"analysis", "organize", "result analysis", "analytic", "datum analysis", 
"analytics", "business analysis", "organized", "quantitative analysis", 
"train need analysis", "analytic think", "analysis trial preparation", 
"analyze statue", "google analytics", "service analysis", "organize individual", 
"account analysis", "analyze department work", "pareto analysis train", 
"organization", "ratio analysis", "statistical analysis", "project organization", 
"organize client's file", "with good analytic", "nielsen analytics", 
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics", 
"market analysis", "analyse", "analytic skill", "superb analytic", 
"financial statement analysis", "credit analysis", "quick analysis", 
"organizational development", "outstanding financial analytic", 
"organization design", "organize conference", "business analytics", 
"industry analysis", "fs analysis", "analyze", "cash flow analysis", 
"investment analysis", "technical analysis bloomberg", "community organize", 
"monthly financial analysis", "expense variance analysis", "stock analysis"
), level1 = c("variance analysis static", "analyze kpi", "financial analysis", 
"variance analysis", "organizational", "analysis", "organize", 
"result analysis", "analytic", "datum analysis", "analytics", 
"business analysis", "organized", "quantitative analysis", "train need analysis", 
"analytic think", "analysis trial preparation", "analyze statue", 
"google analytics", "service analysis", "organize individual", 
"account analysis", "analyze department work", "pareto analysis train", 
"organization", "ratio analysis", "statistical analysis", "project organization", 
"organize client's file", "with good analytic", "nielsen analytics", 
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics", 
"market analysis", "analyse", "analytic skill", "superb analytic", 
"financial statement analysis", "credit analysis", "quick analysis", 
"organizational development", "outstanding financial analytic", 
"organization design", "organize conference", "business analytics", 
"industry analysis", "fs analysis", "analyze", "cash flow analysis", 
"investment analysis", "technical analysis bloomberg", "community organize", 
"monthly financial analysis", "expense variance analysis", "stock analysis"
)), row.names = c(49L, 65L, 77L, 82L, 155L, 190L, 215L, 244L, 
246L, 260L, 287L, 300L, 311L, 323L, 349L, 356L, 378L, 386L, 447L, 
607L, 622L, 664L, 686L, 766L, 824L, 832L, 895L, 922L, 928L, 949L, 
1020L, 1054L, 1079L, 1080L, 1081L, 1088L, 1146L, 1158L, 1228L, 
1248L, 1319L, 1366L, 1385L, 1440L, 1468L, 1475L, 1509L, 1554L, 
1584L, 1606L, 1635L, 1658L, 1660L, 1696L, 1760L, 1762L, 1798L
), class = "data.frame")

for(i in 1:nrow(spellings)){
    skills.db = skills.db %>% mutate(TEST = gsub(spellings$other_spellings[i], spellings$preferred_spellings[i], skills))
  } 

解决方案

Here's one method, using Reduce (which could easily be purrr::reduce) to iterate over each of the spellings and correct them.

spellings_list <- asplit(spellings, 1)
skills.db %>%
  mutate(TEST = Reduce(function(txt, spl) gsub(spl[2], spl[1], txt), spellings_list, init = skills), changed = (skills != TEST))
#                             skills                          level1                            TEST changed
# 1         variance analysis static        variance analysis static        variance analysis static   FALSE
# 2                      analyze kpi                     analyze kpi                     analyse kpi    TRUE
# 3               financial analysis              financial analysis              financial analysis   FALSE
# 4                variance analysis               variance analysis               variance analysis   FALSE
# 5                   organizational                  organizational                  organisational    TRUE
# 6                         analysis                        analysis                        analysis   FALSE
# 7                         organize                        organize                        organize   FALSE
# 8                  result analysis                 result analysis                 result analysis   FALSE
# 9                         analytic                        analytic                        analytic   FALSE
# 10                  datum analysis                  datum analysis                  datum analysis   FALSE
# 11                       analytics                       analytics                       analytics   FALSE
# 12               business analysis               business analysis               business analysis   FALSE
# 13                       organized                       organized                       organized   FALSE
# 14           quantitative analysis           quantitative analysis           quantitative analysis   FALSE
# 15             train need analysis             train need analysis             train need analysis   FALSE
# 16                  analytic think                  analytic think                  analytic think   FALSE
# 17      analysis trial preparation      analysis trial preparation      analysis trial preparation   FALSE
# 18                  analyze statue                  analyze statue                  analyse statue    TRUE
# 19                google analytics                google analytics                google analytics   FALSE
# 20                service analysis                service analysis                service analysis   FALSE
# 21             organize individual             organize individual             organize individual   FALSE
# 22                account analysis                account analysis                account analysis   FALSE
# 23         analyze department work         analyze department work         analyse department work    TRUE
# 24           pareto analysis train           pareto analysis train           pareto analysis train   FALSE
# 25                    organization                    organization                    organisation    TRUE
# 26                  ratio analysis                  ratio analysis                  ratio analysis   FALSE
# 27            statistical analysis            statistical analysis            statistical analysis   FALSE
# 28            project organization            project organization            project organisation    TRUE
# 29          organize client's file          organize client's file          organize client's file   FALSE
# 30              with good analytic              with good analytic              with good analytic   FALSE
# 31               nielsen analytics               nielsen analytics               nielsen analytics   FALSE
# 32                 datum analytics                 datum analytics                 datum analytics   FALSE
# 33               textual analytics               textual analytics               textual analytics   FALSE
# 34                social analytics                social analytics                social analytics   FALSE
# 35 business intelligence analytics business intelligence analytics business intelligence analytics   FALSE
# 36                 market analysis                 market analysis                 market analysis   FALSE
# 37                         analyse                         analyse                         analyse   FALSE
# 38                  analytic skill                  analytic skill                  analytic skill   FALSE
# 39                 superb analytic                 superb analytic                 superb analytic   FALSE
# 40    financial statement analysis    financial statement analysis    financial statement analysis   FALSE
# 41                 credit analysis                 credit analysis                 credit analysis   FALSE
# 42                  quick analysis                  quick analysis                  quick analysis   FALSE
# 43      organizational development      organizational development      organisational development    TRUE
# 44  outstanding financial analytic  outstanding financial analytic  outstanding financial analytic   FALSE
# 45             organization design             organization design             organisation design    TRUE
# 46             organize conference             organize conference             organize conference   FALSE
# 47              business analytics              business analytics              business analytics   FALSE
# 48               industry analysis               industry analysis               industry analysis   FALSE
# 49                     fs analysis                     fs analysis                     fs analysis   FALSE
# 50                         analyze                         analyze                         analyse    TRUE
# 51              cash flow analysis              cash flow analysis              cash flow analysis   FALSE
# 52             investment analysis             investment analysis             investment analysis   FALSE
# 53    technical analysis bloomberg    technical analysis bloomberg    technical analysis bloomberg   FALSE
# 54              community organize              community organize              community organize   FALSE
# 55      monthly financial analysis      monthly financial analysis      monthly financial analysis   FALSE
# 56       expense variance analysis       expense variance analysis       expense variance analysis   FALSE
# 57                  stock analysis                  stock analysis                  stock analysis   FALSE

I added changed merely for a litmus, assuming you know which of your inputs should be different.

Walkthrough:

  1. Reduce is going to go over the whole column of skills for each of the spellings corrections. The input to one iteration of its function will be the output of the previous iteration, a necessary property so that we preserve the changes.

  2. Unfortunately, we can't easily use Vectorize here, and Reduce typically likes simple 2-argument functions (it isn't easily Map-able), so I break the spellings frame into a list of length-2 vectors:

    spellings_list <- asplit(spellings, 1)
    spellings_list
    # $`1`
    # preferred_spellings     other_spellings 
    #      "organisation"     " organization" 
    # $`2`
    # preferred_spellings     other_spellings 
    #   "acknowledgement"   " acknowledgment" 
    # $`3`
    # preferred_spellings     other_spellings 
    #            "cypher"           " cipher" 
    # $`4`
    # preferred_spellings     other_spellings 
    #       "anaesthesia"       " anesthesia" 
    # $`5`
    # preferred_spellings     other_spellings 
    #           "analyse"          " analyze" 
    

    This allows us to more easily use gsub(spl[1], spl[2], ...).

  3. The art of Reduce is knowing which argument to use where, and when to use init=. It's an art. When I put myself in a position where I doubt what is being fed where, I insert a browser() in the beginning of the anon-func and run through a couple of iterations of the reduction.

  4. Suggestion: you might want to sandwich your other_spellings with \\b on either side of its string, to protect against partial-match replacements. For example, your spellings will also replace organizational even though it is not literally present. While that one might be desired, depending on your larger list there could easily be false-positives. (E.g., color/colour and Colorado.)

(Edited: I originally swapped spl[1] and spl[2] in the gsub. Apparently there's also "logic" in the art of this :-)

这篇关于使用查找表中的值替换文本而不使用 for 循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆