过滤至少两个模式匹配的位置 [英] Filter where there are at least two pattern matches
问题描述
我在data.table中有很多文本数据。我有几种感兴趣的文本模式。我想对表格进行子集化,以便显示与至少两种模式匹配的文本。
I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I want to subset the table so it shows text that matches at least two of the patterns.
由于某些模式已经是或的事实,这使情况变得更加复杂,例如 paul | john
。
This is further complicated by the fact that some of the patterns already are an either/or, for example something like "paul|john"
.
我想我要么想要一个在该基础上直接表示子集的表达式,要么可以计算模式发生的次数,然后可以使用该表达式作为子集的工具。我已经找到了计算模式发生次数的方法,但是没有办法将信息清楚地链接到原始数据集中的ID。
I think I either want an expression that would mean directly to subset on that basis, or alternatively if I could count the number of times the patterns occur I could then use that as a tool to subset. I've seen ways to count the number of times patterns occur but not where the info is clearly linked to the IDs in the original dataset, if that makes sense.
目前,我能想到的最好的方法是在data.table中为每个模式添加一列,检查每个模式是否单独匹配,然后在模式的总和。这似乎令人费解,所以我希望有一种更好的方法,因为要检查的模式很多!
At the moment the best I can think of would be to add a column to the data.table for each pattern, check if each pattern matches individually, then filter on the sum of the patterns. This seems quite convoluted so I am hoping there is a better way, as there are quite a lot of patterns to check!
示例数据
text_table <- data.table(ID = (1:5), text = c("lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))
对于示例数据,我希望在子集数据中使用ID 1和3。
With the example data, I would want IDs 1 and 3 in the subsetted data.
感谢您的帮助!
推荐答案
我们可以粘贴
与 |
的 text_patterns,将其用作 str_count中的模式以获取匹配的子字符串的计数,然后检查如果大于1,则过滤数据行。
We can paste
the 'text_patterns' with the |
, use that as pattern in 'str_count' to get the count of matching substring, and check if it is greater than 1 to filter the rows of the data.table
library(data.table)
text_table[str_count(text, paste(text_patterns, collapse="|")) >1]
# ID text
#1: 1 lucy, sarah and paul live on the same street
#2: 3 lucy and sarah are cousins
#3: 5 paul and john have known each other a long time
更新
如果我们需要将每个'text_pattern'视为固定模式,则我们遍历这些模式,检查该模式是否存在( str_detect
),然后将所有带有 +
的模式的 sum
为子集行创建逻辑向量
Update
If we need to consider each 'text_pattern' as a fixed pattern, we loop through the patterns, check whether the pattern is present (str_detect
) and get the sum
of all the patterns with +
to create the logical vector for subsetting rows
i1 <- text_table[, Reduce(`+`, lapply(text_patterns,
function(x) str_detect(text, x))) >1]
text_table[i1]
# ID text
#1: 1 lucy, sarah and paul live on the same street
#2: 3 lucy and sarah are cousins
这篇关于过滤至少两个模式匹配的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!