循环grepl（）通过data.table（R） [英] Looping grepl() through data.table (R)

查看：216 发布时间：2017/3/12 11:39:37 regex r data.table data-cleaning

本文介绍了循环grepl（）通过data.table（R）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据集存储为data.table DT ，如下所示：

  print（DT）
 category行业
 1：行政管理
 2：护士车
 3：卡车
 4：行政管理
 5：仓储护士
 6：仓储管理
 7：卡车
 8：护士护士
 9：护士车

我想将表格缩减为行业符合类别的行。我的一般方法是使用 grepl（） regex匹配字符串'^ {{INDUSTRY}} [az] + $'，并且 DT $ category 的每一行插入 DT $ industry 使用 infuse（）在正则表达式字符串中输入code> {{INDUSTRY}} 。我努力找到一个圆滑的data.table解决方案，可以正确地循环通过表，并进行行内比较，所以我诉诸一个for循环，以完成工作：

 模板<  - ^ {{IND}} [az] + $
 DT [，match：= FALSE，] 
 seq（1，length（DT $ category）））{
 ind<  -  DT [i] $ industry 
 categ<  -  d.daily [i] $ category 
 if （infuse（IND = ind，template），categ））{
 DT [i] $ match <-ENG 
} 
} 
 DT < TRUE] 
 print（DT）
类别行业
 1：行政管理
 2：卡车运输
 3：行政管理
 4： $ b 5：护士护士

但是，我相信这可以做得更好。任何建议，如何我可以通过利用data.table包的功能实现这个结果？我的理解是，在这种情况下，使用该包的方法可能比for循环更有效。

解决方案

Data.table适合分组操作;我认为这是它的帮助，假设你有很多行与同行业：

  DT [DT [，.I [ grep（industry，category）]，by = industry] $ V1]

http://stackoverflow.com/a/16574176/1191259\">按组分组子集的当前习语，感谢@eddi 。

注释。这些可能有助于进一步：

有许多行具有相同的行业类别组合，请尝试 by =。（industry，category）

 
  p>尝试 grep （像Ken和Richard的答案中的选项）。

 
I have a dataset stored as a data.table DT that looks like this:
print(DT)
   category            industry
1: administration      admin
2: nurse practitioner  truck
3: trucking            truck
4: administration      admin
5: warehousing         nurse
6: warehousing         admin
7: trucking            truck
8: nurse practitioner  nurse         
9: nurse practitioner  truck 
I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl() to regex match the string '^{{INDUSTRY}}[a-z ]+$' and each row of DT$category, with each corresponding row of DT$industry inserted in place of {{INDUSTRY}} in the regex string using infuse(). I struggled to find a sleek data.table solution that would properly loop through the table and make within-row comparisons, so I resorted to a for-loop to get the job done:
template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
    ind <- DT[i]$industry
    categ <- d.daily[i]$category
    if (grepl(infuse(IND=ind,template),categ)){
        DT[i]$match <- TRUE
    }
}
DT<- DT[match==TRUE]
print(DT)
       category            industry
1: administration      admin
2: trucking            truck
3: administration      admin
4: trucking            truck
5: nurse practitioner  nurse         
However, I am sure this can be done in a better way. Any suggestions for how I could achieve this result by utilizing the data.table package's functionality? It's my understanding that, in this context, an approach that uses the package would likely be more efficient than a for-loop.
 解决方案 
Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:
DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]
This uses the current idiom for subsetting by group, thanks to @eddi .



Comments. These might help further:


If you have many rows with the same industry-category combo, try by=.(industry,category).
Try something else in the place of grep (like the options in Ken and Richard's answers).


                        这篇关于循环grepl（）通过data.table（R）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

循环grepl（）通过data.table（R） [英] Looping grepl() through data.table (R)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

循环grepl（）通过data.table（R） [英] Looping grepl() through data.table (R)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭