在dplyr中的filter中使用filter会产生意外结果 [英] Using filter inside filter in dplyr gives unexpected results

查看:123
本文介绍了在dplyr中的filter中使用filter会产生意外结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 R 3.1.2 dplyr 0.4.0

我正在尝试在过滤器过滤器 c>,听起来很简单,我不明白为什么它没有给我我期望的结果。这是我大约6个月前编写的代码,我相当确定它能正常工作,因此由于更新的R版本或 dplyr 或某些其他依赖性而使它停止工作。无论如何,这是一些简单的代码,可根据在df2的列上使用 filter 找到的条件从df1过滤行。

I'm trying to use a filter within a filter, which sounds very simple and I don't understand why it doesn't give me the result I expect. This is code I wrote about 6 months ago and I'm fairly certain it worked, so either it stopped working because of an updated R version or dplyr or some other dependency. Anyway, here is some simple code that filters rows from df1 based on a condition that is found with a filter on a column in df2.

df1 <- data.frame(x = c("A", "B"), stringsAsFactors = FALSE)
df2 <- data.frame(x = "A", y = TRUE, stringsAsFactors = FALSE)
dplyr::filter(df1, x %in% (dplyr::filter(df2, y)$x))

我希望它显示 df1 的第一行,但是我得到

I expect this to show the first row of df1, but instead I get

# [1] x
# <0 rows> (or 0-length row.names)

我不确定该怎么做。为什么它返回一个向量和一个空的data.frame?

which I'm not sure what to make of. Why is it returning a vector AND an empty data.frame?

如果我将过滤器代码分成两个独立的语句,我将得到期望的结果

If I break up the filter code into two separate statements, I get what I expect

xval <- dplyr::filter(df2, y)$x
dplyr::filter(df1, x %in% xval)

#   x
# 1 A

有人可以帮忙吗我弄清楚为什么会发生这种现象?我并不是说这是一个错误,但我不明白。

Can anyone help me figure out why this behaviour is happening? I'm not saying it's a bug, but I don't understand it.

推荐答案

这是一个有效的问题,为什么您要采用这种方法不起作用(显然已经不再起作用)。我不能回答这个问题,但是我会建议采用另一种方法,如上所述,它避免了嵌套函数调用( filter inside 另一个 filter ),这就是IMO的dplyr的用途:通过易于阅读和理解的语法从左到右,从上到下进行表达。

It's a valid question, why your approach doesn't work (any more, apparently). I can't answer that but I would suggest a different approach, as commented above, which avoids nested function calls (filter inside another filter) which, IMO, is what dplyr is made for: being expressive by easy to read and understand syntax, from left to right, top to bottom.

因此,在您的示例中,由于您感兴趣的列都被命名为 x,因此您可以执行以下操作:

So for your example, because the columns you are interested in are both named "x" you can do:

filter(df2, y) %>% select(x) %>% inner_join(df1)




  • 按 y列过滤df2数据

  • 仅选择 x列

  • 在公共列( x)上用df1执行一个inner_join。 inner_join的意思是:从y中有匹配值的x中返回所有行,从x和y中返回所有列。

  • 如果它们不同,例如 z和 x,则可以使用:

    And if they were different, for example "z" and "x" you could use:

    filter(df2, y) %>% select(x) %>% inner_join(df1, by = c("z" = "x"))
    






    如Hadley在下面的评论中所述,使用 semi_join 代替 inner_join 在这里。文档说:


    As noted by Hadley in his comment below, it would be safer to use a semi_join instead of inner_join here. The documentation says:


    semi_join:返回x中所有在y中有匹配值的行,
    仅保留x中的列。

    semi_join: return all rows from x where there are matching values in y, keeping just columns from x.

    半联接不同于内部联接,因为内部联接将
    为y的每个匹配行返回x的x行,其中半联接将
    永远不会重复x的行。

    A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

    因此,您可以为以下示例做此操作:

    Hence, you could do for the example case:

    filter(df2, y) %>% select(x) %>% semi_join(df1)
    

    这篇关于在dplyr中的filter中使用filter会产生意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆