R dplyr 过滤器基于匹配搜索词与选择列中任何作品的第一个词 [英] R dplyr filter based on matching search term with first words of any work in select columns

查看:14
本文介绍了R dplyr 过滤器基于匹配搜索词与选择列中任何作品的第一个词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据匹配特定正则表达式的文本中的单词开头的关键字从选定的列中过滤单词.在这里,我试图选择所有以bio"或15"开头的词.但是搜索词也可以在一些单词的中间找到,例如名称列的共生和代码列的 161540.

I'm trying to filter words from selected columns based on keywords that start the words in the text of match a particular regular expression. Here, I'm trying to pick all words starting with "bio" or "15". But the search terms can also be found in the middle of some words like symbiotic for the Name column and 161540 for the Code column.

**Name**                     **Code**
Biofuel is good          159403
Bioecological is good    161540
Probiotics is good       159883
Good is symbiotic        1877447

我尝试了下面的代码

Innov_filter <- Innov_Data %>% 
  select(everything()) %>% 
  filter(str_detect(str_to_lower(Name), "bio") | str_detect(str_to_lower(Code), "bio"))

然而,这不起作用,因为它正在过滤不符合任何条件的最后一行.我将非常感谢根据搜索词作为单词的一部分的首次出现而进行严格搜索的帮助,而不仅仅是在单词的任何位置.

This is however not working because it is filtering the last row which doesn't fit into any of the conditions. I will appreciate help in strict search based on the first appearance of the search term as part of the word and not just in any location of the word.

谢谢

推荐答案

EDIT

如果我们想选择任何以bio"开头的单词,我们可以做

If we want to select any word which starts with "bio" we can do

df %>%
  filter(str_detect(str_to_lower(Name), "\bbio") | str_detect(Code, "^15"))

OR 基础 R 中的相同内容

OR the same thing in base R

df[sapply(strsplit(df$Name, "\s+"), function(x) any(grepl("^bio", tolower(x)))) | 
                                                 grepl("^15", df$Code), ]

<小时>

原答案

这将选择 Name(word(Name) 仅返回第一个单词)或 Code 的第一个单词中存在bio"的行以15"开头.

This selects rows where "bio" is present in first word of Name (word(Name) returns only first word) or Code which starts with "15".

library(tidyverse)
df %>%
  filter(str_detect(str_to_lower(word(Name)), "bio") | str_detect(Code, "^15"))


#                   Name   Code
#1       Biofuel is good 159403
#2 Bioecological is good 161540
#3    Probiotics is good 159883

<小时>

使用相同的逻辑,但在基础 R 中,我们可以做到


Using the same logic but in base R, we can do

df[sapply(strsplit(df$Name, "\s+"), function(x) grepl("bio", tolower(x[1]))) 
                                  | grepl("^15", df$Code), ]

#                   Name   Code
#1       Biofuel is good 159403
#2 Bioecological is good 161540
#3    Probiotics is good 159883

在这里,它在空白处拆分字符串,然后从每个 (x[1]) 中提取第一个单词并检查它是否包含bio"或获取以开头的行"15".

Here, it splits the string at empty space and then extracts the first word from each (x[1]) and check if it has "bio" in it OR get rows which starts with "15".

这篇关于R dplyr 过滤器基于匹配搜索词与选择列中任何作品的第一个词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆