为什么同一查询使用dplyr在不同的R会话上返回不同的结果? [英] Why is the same query returning different results on different R sessions using dplyr?

查看:234
本文介绍了为什么同一查询使用dplyr在不同的R会话上返回不同的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然我正在与我的一位同事一起进行一个项目,该项目涉及使用tidyverse的dplyr软件包来操纵数据帧,但我注意到,即使我们使用相同的代码并相同的数据。

While I was working on a project with a colleague of mine, that involved using the package dplyr from tidyverse to manipulate a data frame, I've noticed that some of our results ware different even though we ware using the same code and the same data.

两个R会话的会话信息:

Session infos from both R sessions:

桌面:

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 
[2] LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [4] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [7] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
[10] sp_1.3-2      

RStudio Cloud

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] randomNames_1.4-0.0  plotly_4.9.2.1       lubridate_1.7.9     
 [4] openintro_2.0.0      usdata_0.1.0         cherryblossom_0.1.0 
 [7] airports_0.1.0       leaflet_2.0.3        forcats_0.5.0       
[10] stringr_1.4.0        dplyr_1.0.0          purrr_0.3.4         
[13] readr_1.3.1          tidyr_1.1.0          tibble_3.0.2        
[16] ggplot2_3.3.2        tidyverse_1.3.0      shinydashboard_0.7.1
[19] shiny_1.5.0         

使用虹膜的可复制示例:


library(tidyverse)

#lets say that each flower on the data frame iris had a name


iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
  

#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)

iris_big <- rbind(iris,iris[sample_index,])

我试图知道每个Specie瓷器有多少独特的花朵,所以我写了以下查询:

I was trying to know how many unique flowers of each Specie there ware so I wrote the following query:

 
iris_big %>% 
  group_by(name,Species) %>% 
  count() %>% 
  ungroup() %>% 
  count(Species)

问题是,它返回两个不同的结果,一个返回在我的桌面上,另一个返回在我朋友的桌面上(他正在使用Rstudio Cloud)。

The problem is, it returns two different results, one on my desktop and another on my friend's desktop (he was using Rstudio Cloud).

我的桌面:

# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

Rstudio云:


Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        83
2 versicolor    80
3 virginica     87

我最终通过使用以下查询解决了此问题:

I eventually fixed this issue by using the following querie:

iris_big %>% 
  group_by(name,Species) %>% 
  count() %>% 
  ungroup() %>%
  select(Species) %>% 
  group_by(Species) %>% 
  count()

# A tibble: 3 x 2
# Groups:   Species [3]
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

但是我想知道为什么会这样。

But I would like to know why this is happening.

推荐答案

(在前面,由于我第一个答案(关于R-3.5和R之间 sample.int 的变化-3.6)似乎仍然与为什么相同的查询为什么在不同的R会话上返回不同的结果 有关;这不是引起这种症状的原因,但是可能是因为问题的第一个版本使用的是 sample 。相反,真正的罪魁祸首是dplyr中版本的重大变化。)

(Up front, I'm submitting this as an alternate answer since my first answer (about the change in sample.int between R-3.5 and R-3.6) still seems relevant to the question of "Why is the same query returning different results on different R sessions"; it is not what is causing this symptom, but it very easily could have been since the first version of your question used sample. Instead, the real culprit here is due to an equally "major" version change in dplyr.)

您正在改变 dplyr :: count 的行为。

在dplyr-0.8.3中,?count 说:

In dplyr-0.8.3, ?count says:

      wt: (Optional) If omitted (and no variable named 'n' exists in
          the data), will count the number of rows. If specified, will
          perform a "weighted" tally by summing the (non-missing)
          values of variable 'wt'. A column named 'n' (but not 'nn' or
          'nnn') will be used as weighting variable by default in
          'tally()', but not in 'count()'. This argument is
          automatically quoted and later evaluated in the context of
          the data frame. It supports unquoting. See
          'vignette("programming")' for an introduction to these
          concepts.

在dplyr-1.0.0中:

In dplyr-1.0.0:

      wt: <'data-masking'> Frequency weights. Can be a variable (or
          combination of variables) or 'NULL'. 'wt' is computed once
          for each unique combination of the counted variables.

            • If a variable, 'count()' will compute 'sum(wt)' for each
              unique combination.

            • If 'NULL', the default, the computation depends on
              whether a column of frequency counts 'n' exists in the
              data frame. If it exists, the counts are computed with
              'sum(n)' for each unique combination. Otherwise, 'n()' is
              used to compute the counts. Supply 'wt = n()' to force
              this behaviour even if you have an 'n' column in the data
              frame.

要看的重要部分是在0.8.3中,它说一个名为'n的'列'...将在... tally()中使用,但在'count()'''中不使用。但是,在1.0.0版中,它不包含该语言。我通过使用R-3.5.3 / dplyr-0.8.3和R-4.0.2 / dplyr-1.0.0复制了您的结果。

The important part to see is that in 0.8.3, it says that a "column named 'n' ... will be used ... in 'tally()' but not in 'count()'". However, in 1.0.0, it does not include that verbiage. I reproduced your results by using R-3.5.3/dplyr-0.8.3 and R-4.0.2/dplyr-1.0.0.

解决方法是以下两种方法之一:

The way around it is one of two ways:


  1. 使用 count(...,wt = n())

R.version$version.string
# [1] "R version 3.5.3 (2019-03-11)"
iris_big %>%
  group_by(name,Species) %>%
  count() %>%
  ungroup() %>%
  count(Species, wt = n())
# # A tibble: 3 x 2
#   Species        n
#   <fct>      <int>
# 1 setosa        50
# 2 versicolor    50
# 3 virginica     50


R.version$version.string
# [1] "R version 4.0.2 (2020-06-22)"
iris_big %>%
  group_by(name,Species) %>%
  count() %>%
  ungroup() %>%
  count(Species, wt = n())
# # A tibble: 3 x 2
#   Species        n
#   <fct>      <int>
# 1 setosa        50
# 2 versicolor    50
# 3 virginica     50



  • 在分组内转换为使用 tally ,如

    iris_big %>%
      group_by(name,Species) %>%
      count() %>%
      group_by(Species) %>%
      tally()
    



  • 或您可以选择另一个选项:

    Or you can go with another option:


    1. 意识到这是问题 dplyr#5298 ,该问题已在尚未发布的dplyr-1.0.1中修复(我不知道时间表)。这样,RStudio Cloud用户可以选择dplyr的github版本,以受益于 dplyr#5349 ,已经合并的PR。这应该将 count 的行为恢复为1.0.0之前的行为(尽管哈德利对此事的意见)。

    1. Realize that this is issue dplyr#5298, which is fixed in the yet-to-be-released dplyr-1.0.1 (I do not know a timeline). With that, the RStudio Cloud user can opt for the github version of dplyr to benefit from dplyr#5349, a PR that has already been merged. This should revert count's behavior back to the pre-1.0.0 behavior (despite Hadley's opinion on the matter).

    这篇关于为什么同一查询使用dplyr在不同的R会话上返回不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    相关文章
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆