结合 grepl 过滤 dplyr 中的观察结果 [英] Filtering observations in dplyr in combination with grepl

查看:24
本文介绍了结合 grepl 过滤 dplyr 中的观察结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试研究如何使用 dplyrgrepl 从大型数据集中过滤一些观察结果.如果其他解决方案更佳,我不喜欢 grepl.

I am trying to work out how to filter some observations from a large dataset using dplyr and grepl . I am not wedded to grepl, if other solutions would be more optimal.

以这个示例df:

df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange", 
                          "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") )
df1


#     fruit group
#1    apple     A
#2   orange     B
#3   xapple     A
#4  xorange     B
#5  applexx     A
#6 orangexx     B
#7  banxana     A
#8  appxxle     B

我想:

  1. 过滤掉那些以'x'开头的情况
  2. 过滤掉以xx"结尾的情况

我已经设法摆脱所有包含x"或xx"但不以开头或结尾的内容.以下是如何摆脱内部包含 'xx' 的所有内容(不仅仅是结尾):

I have managed to work out how to get rid of everything that contains 'x' or 'xx', but not beginning with or ending with. Here is how to get rid of everything with 'xx' inside (not just ending with):

df1 %>%  filter(!grepl("xx",fruit))

#    fruit group
#1   apple     A
#2  orange     B
#3  xapple     A
#4 xorange     B
#5 banxana     A

这显然错误地"(从我的角度来看)过滤了appxxle".

This obviously 'erroneously' (from my point of view) filtered 'appxxle'.

我从来没有完全掌握正则表达式.我一直在尝试修改代码,例如:grepl("^(?!x).*$", df1$fruit, perl = TRUE) 以尝试使其在过滤器命令中工作,但我不太明白.

I have never fully got to grips with regular expressions. I've been trying to modify code such as: grepl("^(?!x).*$", df1$fruit, perl = TRUE) to try and make it work within the filter command, but am not quite getting it.

预期输出:

#      fruit group
#1     apple     A
#2    orange     B
#3   banxana     A
#4   appxxle     B

如果可能的话,我想在 dplyr 中执行此操作.

I'd like to do this inside dplyr if possible.

推荐答案

我不明白你的第二个正则表达式,但这个更基本的正则表达式似乎可以解决问题:

I didn't understand your second regex, but this more basic regex seems to do the trick:

df1 %>% filter(!grepl("^x|xx$", fruit))
###
    fruit group
1   apple     A
2  orange     B
3 banxana     A
4 appxxle     B

我假设您知道这一点,但您根本不必在这里使用 dplyr:

And I assume you know this, but you don't have to use dplyr here at all:

df1[!grepl("^x|xx$", df1$fruit), ]
###
    fruit group
1   apple     A
2  orange     B
7 banxana     A
8 appxxle     B

正则表达式正在寻找以 x 开头或以 xx 结尾的字符串.^$ 分别是字符串开头和结尾的正则表达式锚点.| 是 OR 运算符.我们用 ! 否定 grepl 的结果,所以我们找到了与正则表达式中的内容不匹配的字符串.

The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.

这篇关于结合 grepl 过滤 dplyr 中的观察结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆