R dplyr 过滤器基于匹配搜索词与选择列中任何作品的第一个词 [英] R dplyr filter based on matching search term with first words of any work in select columns
问题描述
我正在尝试根据匹配特定正则表达式的文本中的单词开头的关键字从选定的列中过滤单词.在这里,我试图选择所有以bio"或15"开头的词.但是搜索词也可以在一些单词的中间找到,例如名称列的共生和代码列的 161540.
I'm trying to filter words from selected columns based on keywords that start the words in the text of match a particular regular expression. Here, I'm trying to pick all words starting with "bio" or "15". But the search terms can also be found in the middle of some words like symbiotic for the Name column and 161540 for the Code column.
**Name** **Code**
Biofuel is good 159403
Bioecological is good 161540
Probiotics is good 159883
Good is symbiotic 1877447
我尝试了下面的代码
Innov_filter <- Innov_Data %>%
select(everything()) %>%
filter(str_detect(str_to_lower(Name), "bio") | str_detect(str_to_lower(Code), "bio"))
然而,这不起作用,因为它正在过滤不符合任何条件的最后一行.我将非常感谢根据搜索词作为单词的一部分的首次出现而进行严格搜索的帮助,而不仅仅是在单词的任何位置.
This is however not working because it is filtering the last row which doesn't fit into any of the conditions. I will appreciate help in strict search based on the first appearance of the search term as part of the word and not just in any location of the word.
谢谢
推荐答案
EDIT
如果我们想选择任何以bio"开头的单词,我们可以做
If we want to select any word which starts with "bio" we can do
df %>%
filter(str_detect(str_to_lower(Name), "\bbio") | str_detect(Code, "^15"))
OR 基础 R 中的相同内容
OR the same thing in base R
df[sapply(strsplit(df$Name, "\s+"), function(x) any(grepl("^bio", tolower(x)))) |
grepl("^15", df$Code), ]
<小时>
原答案
这将选择 Name
(word(Name)
仅返回第一个单词)或 Code
的第一个单词中存在bio"的行以15"开头.
This selects rows where "bio" is present in first word of Name
(word(Name)
returns only first word) or Code
which starts with "15".
library(tidyverse)
df %>%
filter(str_detect(str_to_lower(word(Name)), "bio") | str_detect(Code, "^15"))
# Name Code
#1 Biofuel is good 159403
#2 Bioecological is good 161540
#3 Probiotics is good 159883
<小时>
使用相同的逻辑,但在基础 R 中,我们可以做到
Using the same logic but in base R, we can do
df[sapply(strsplit(df$Name, "\s+"), function(x) grepl("bio", tolower(x[1])))
| grepl("^15", df$Code), ]
# Name Code
#1 Biofuel is good 159403
#2 Bioecological is good 161540
#3 Probiotics is good 159883
在这里,它在空白处拆分字符串,然后从每个 (x[1]
) 中提取第一个单词并检查它是否包含bio"或获取以开头的行"15".
Here, it splits the string at empty space and then extracts the first word from each (x[1]
) and check if it has "bio" in it OR get rows which starts with "15".
这篇关于R dplyr 过滤器基于匹配搜索词与选择列中任何作品的第一个词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!