过滤至少两个模式匹配的位置 [英] Filter where there are at least two pattern matches

查看:79
本文介绍了过滤至少两个模式匹配的位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在data.table中有很多文本数据。我有几种感兴趣的文本模式。我想对表格进行子集化,以便显示与至少两种模式匹配的文本。

I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I want to subset the table so it shows text that matches at least two of the patterns.

由于某些模式已经是或的事实,这使情况变得更加复杂,例如 paul | john

This is further complicated by the fact that some of the patterns already are an either/or, for example something like "paul|john".

我想我要么想要一个在该基础上直接表示子集的表达式,要么可以计算模式发生的次数,然后可以使用该表达式作为子集的工具。我已经找到了计算模式发生次数的方法,但是没有办法将信息清楚地链接到原始数据集中的ID。

I think I either want an expression that would mean directly to subset on that basis, or alternatively if I could count the number of times the patterns occur I could then use that as a tool to subset. I've seen ways to count the number of times patterns occur but not where the info is clearly linked to the IDs in the original dataset, if that makes sense.

目前,我能想到的最好的方法是在data.table中为每个模式添加一列,检查每个模式是否单独匹配,然后在模式的总和。这似乎令人费解,所以我希望有一种更好的方法,因为要检查的模式很多!

At the moment the best I can think of would be to add a column to the data.table for each pattern, check if each pattern matches individually, then filter on the sum of the patterns. This seems quite convoluted so I am hoping there is a better way, as there are quite a lot of patterns to check!

示例数据

text_table <- data.table(ID = (1:5), text = c("lucy, sarah and paul live on the same street",
                                              "lucy has only moved here recently",
                                              "lucy and sarah are cousins",
                                              "john is also new to the area",
                                              "paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))

对于示例数据,我希望在子集数据中使用ID 1和3。

With the example data, I would want IDs 1 and 3 in the subsetted data.

感谢您的帮助!

推荐答案

我们可以粘贴 | 的 text_patterns,将其用作 str_count中的模式以获取匹配的子字符串的计数,然后检查如果大于1,则过滤数据行。

We can paste the 'text_patterns' with the |, use that as pattern in 'str_count' to get the count of matching substring, and check if it is greater than 1 to filter the rows of the data.table

library(data.table)
text_table[str_count(text, paste(text_patterns, collapse="|")) >1]
#    ID                                            text
#1:  1    lucy, sarah and paul live on the same street
#2:  3                      lucy and sarah are cousins
#3:  5 paul and john have known each other a long time



更新



如果我们需要将每个'text_pattern'视为固定模式,则我们遍历这些模式,检查该模式是否存在( str_detect ),然后将所有带有 + 的模式的 sum 为子集行创建逻辑向量

Update

If we need to consider each 'text_pattern' as a fixed pattern, we loop through the patterns, check whether the pattern is present (str_detect) and get the sum of all the patterns with + to create the logical vector for subsetting rows

i1 <- text_table[, Reduce(`+`, lapply(text_patterns, 
       function(x) str_detect(text, x))) >1]
text_table[i1]
#    ID                                         text
#1:  1 lucy, sarah and paul live on the same street
#2:  3                   lucy and sarah are cousins

这篇关于过滤至少两个模式匹配的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆