在grep中与dplyr进行过滤观察 [英] Filtering observations in dplyr in combination with grepl

查看:258
本文介绍了在grep中与dplyr进行过滤观察的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 dplyr grepl 来筛选大型数据集中的一些观察结果。如果其他解决方案更为优化,我不会收到 grepl



取此示例df:

  df1 < -  data.frame(fruit = c(apple,orange,xapple,xorange 
applexx,orangexx,banxana,appxxle),group = c(A,B))
df1


#水果组
#1苹果A
#2橙色B
#3 xapple A
#4 xorange B
#5 applexx A
#6 orangexx B
#7 banxana A
#8 appxxle B

我想:


  1. 过滤掉以'x'开头的那些案例

  2. 过滤掉以'我已经设法解决了如何摆脱包含x或xx的所有内容,但是没有开始。与或结束。以下是如何摆脱所有内容中的xx(不仅仅是结尾):

      df1%>%filter (!grepl(xx,水果))

    #水果组
    #1苹果A
    #2橙色B
    #3 xapple A
    #4 xorange B
    #5 banxana A

    这显然是错误的的视图)过滤'appxxle'。



    我从来没有完全掌握正则表达式。我一直在尝试修改代码,例如: grepl(^(?! x)。* $,df1 $ fruit,perl = TRUE)它可以在过滤器命令中工作,但不太了解。



    预期输出:

     #fruit group 
    #1 apple A
    #2 orange B
    #3 banxana A
    #4 appxxle B

    如果可能,我想在 dplyr 内进行此操作。

    解决方案

    我不明白你的第二个正则表达式,但是这个更基本的正则表达式似乎是诀窍:

      df1%>%filter(!grepl(^ x | xx $,fruit))
    ###
    水果组
    1苹果A
    2橙色B
    3 banxana A
    4 appxxle B

    我认为你知道这一点,但是你根本就不必使用 dplyr

      df1 [!grepl(^ x | xx $,df1 $ fruit),] 
    ###
    fruit group
    1苹果A
    2橙色B
    7 banxana A
    8 appxxle B

    正则表达式正在寻找以 x 开始的字符串,或以 xx 结尾。 ^ $ 分别是字符串的开头和结尾的正则表达式锚点。 | 是OR运算符。我们正在使用取消 grepl 的结果,所以我们发现与内部不符的字符串正则表达式。


    I am trying to work out how to filter some observations from a large dataset using dplyr and grepl . I am not wedded to grepl, if other solutions would be more optimal.

    Take this sample df:

    df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange", 
                              "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") )
    df1
    
    
    #     fruit group
    #1    apple     A
    #2   orange     B
    #3   xapple     A
    #4  xorange     B
    #5  applexx     A
    #6 orangexx     B
    #7  banxana     A
    #8  appxxle     B
    

    I want to:

    1. filter out those cases beginning with 'x'
    2. filter out those cases ending with 'xx'

    I have managed to work out how to get rid of everything that contains 'x' or 'xx', but not beginning with or ending with. Here is how to get rid of everything with 'xx' inside (not just ending with):

    df1 %>%  filter(!grepl("xx",fruit))
    
    #    fruit group
    #1   apple     A
    #2  orange     B
    #3  xapple     A
    #4 xorange     B
    #5 banxana     A
    

    This obviously 'erroneously' (from my point of view) filtered 'appxxle'.

    I have never fully got to grips with regular expressions. I've been trying to modify code such as: grepl("^(?!x).*$", df1$fruit, perl = TRUE) to try and make it work within the filter command, but am not quite getting it.

    Expected output:

    #      fruit group
    #1     apple     A
    #2    orange     B
    #3   banxana     A
    #4   appxxle     B
    

    I'd like to do this inside dplyr if possible.

    解决方案

    I didn't understand your second regex, but this more basic regex seems to do the trick:

    df1 %>% filter(!grepl("^x|xx$", fruit))
    ###
        fruit group
    1   apple     A
    2  orange     B
    3 banxana     A
    4 appxxle     B
    

    And I assume you know this, but you don't have to use dplyr here at all:

    df1[!grepl("^x|xx$", df1$fruit), ]
    ###
        fruit group
    1   apple     A
    2  orange     B
    7 banxana     A
    8 appxxle     B
    

    The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.

    这篇关于在grep中与dplyr进行过滤观察的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆