通过 data.table (R) 循环 grepl() [英] Looping grepl() through data.table (R)

查看：17 发布时间：2022/1/13 19:28:37 regex r data.table data-cleaning

本文介绍了通过 data.table (R) 循环 grepl()的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个存储为 data.table DT 的数据集，如下所示:

I have a dataset stored as a data.table DT that looks like this:

print(DT)
   category            industry
1: administration      admin
2: nurse practitioner  truck
3: trucking            truck
4: administration      admin
5: warehousing         nurse
6: warehousing         admin
7: trucking            truck
8: nurse practitioner  nurse         
9: nurse practitioner  truck

我想将表格缩减为只有行业与类别匹配的行.我的一般方法是使用 grepl() 正则表达式匹配字符串 '^{{INDUSTRY}}[az ]+$' 和 DT$ 的每一行类别，使用 infuse().我很难找到一个流畅的 data.table 解决方案，它可以正确地循环遍历表并进行行内比较，所以我求助于 for 循环来完成工作:


I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl() to regex match the string '^{{INDUSTRY}}[a-z ]+$' and each row of DT$category, with each corresponding row of DT$industry inserted in place of {{INDUSTRY}} in the regex string using infuse(). I struggled to find a sleek data.table solution that would properly loop through the table and make within-row comparisons, so I resorted to a for-loop to get the job done:
template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
    ind <- DT[i]$industry
    categ <- d.daily[i]$category
    if (grepl(infuse(IND=ind,template),categ)){
        DT[i]$match <- TRUE
    }
}
DT<- DT[match==TRUE]
print(DT)
       category            industry
1: administration      admin
2: trucking            truck
3: administration      admin
4: trucking            truck
5: nurse practitioner  nurse         

但是，我相信这可以以更好的方式完成.关于如何通过使用 data.table 包的功能来实现此结果的任何建议?据我了解，在这种情况下，使用包的方法可能比 for 循环更有效.
However, I am sure this can be done in a better way. Any suggestions for how I could achieve this result by utilizing the data.table package's functionality? It's my understanding that, in this context, an approach that uses the package would likely be more efficient than a for-loop.
推荐答案
Data.table 擅长分组操作；我认为这就是它可以提供帮助的方式，假设您有很多行属于同一行业:
Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:
DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]

这使用 当前的成语分组子集，感谢@eddi .
评论.这些可能会有所帮助:
如果您有许多行具有相同的行业类别组合，请尝试 by=.(industry,category).
尝试用其他方法代替 grep(例如 Ken 和 Richard 的答案中的选项).
Try something else in the place of grep (like the options in Ken and Richard's answers).

                        这篇关于通过 data.table (R) 循环 grepl()的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

通过 data.table (R) 循环 grepl() [英] Looping grepl() through data.table (R)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过 data.table (R) 循环 grepl() [英] Looping grepl() through data.table (R)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭