循环grepl()通过data.table(R) [英] Looping grepl() through data.table (R)
问题描述
我有一个数据集存储为data.table DT
,如下所示:
print(DT)
category行业
1:行政管理
2:护士车
3:卡车
4:行政管理
5:仓储护士
6:仓储管理
7:卡车
8:护士护士
9:护士车
我想将表格缩减为行业符合类别的行。我的一般方法是使用 grepl()
regex匹配字符串'^ {{INDUSTRY}} [az] + $'
,并且 DT $ category
的每一行插入 DT $ industry
使用 infuse()
在正则表达式字符串中输入code> {{INDUSTRY}} 。我努力找到一个圆滑的data.table解决方案,可以正确地循环通过表,并进行行内比较,所以我诉诸一个for循环,以完成工作:
模板< - ^ {{IND}} [az] + $
DT [,match:= FALSE,]
seq(1,length(DT $ category))){
ind< - DT [i] $ industry
categ< - d.daily [i] $ category
if (infuse(IND = ind,template),categ)){
DT [i] $ match <-ENG
}
}
DT < TRUE]
print(DT)
类别行业
1:行政管理
2:卡车运输
3:行政管理
4: $ b 5:护士护士
但是,我相信这可以做得更好。任何建议,如何我可以通过利用data.table包的功能实现这个结果?我的理解是,在这种情况下,使用该包的方法可能比for循环更有效。
Data.table适合分组操作;我认为这是它的帮助,假设你有很多行与同行业:
DT [DT [,.I [ grep(industry,category)],by = industry] $ V1]
http://stackoverflow.com/a/16574176/1191259\">按组分组子集的当前习语,感谢@eddi 。
注释。这些可能有助于进一步:
-
有许多行具有相同的行业类别组合,请尝试
by =。(industry,category)
- p>尝试
grep
(像Ken和Richard的答案中的选项)。
I have a dataset stored as a data.table DT
that looks like this:
print(DT)
category industry
1: administration admin
2: nurse practitioner truck
3: trucking truck
4: administration admin
5: warehousing nurse
6: warehousing admin
7: trucking truck
8: nurse practitioner nurse
9: nurse practitioner truck
I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl()
to regex match the string '^{{INDUSTRY}}[a-z ]+$'
and each row of DT$category
, with each corresponding row of DT$industry
inserted in place of {{INDUSTRY}}
in the regex string using infuse()
. I struggled to find a sleek data.table solution that would properly loop through the table and make within-row comparisons, so I resorted to a for-loop to get the job done:
template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
ind <- DT[i]$industry
categ <- d.daily[i]$category
if (grepl(infuse(IND=ind,template),categ)){
DT[i]$match <- TRUE
}
}
DT<- DT[match==TRUE]
print(DT)
category industry
1: administration admin
2: trucking truck
3: administration admin
4: trucking truck
5: nurse practitioner nurse
However, I am sure this can be done in a better way. Any suggestions for how I could achieve this result by utilizing the data.table package's functionality? It's my understanding that, in this context, an approach that uses the package would likely be more efficient than a for-loop.
Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:
DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]
This uses the current idiom for subsetting by group, thanks to @eddi .
Comments. These might help further:
If you have many rows with the same industry-category combo, try
by=.(industry,category)
.Try something else in the place of
grep
(like the options in Ken and Richard's answers).
这篇关于循环grepl()通过data.table(R)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!