为什么同一查询使用dplyr在不同的R会话上返回不同的结果? [英] Why is the same query returning different results on different R sessions using dplyr?
问题描述
虽然我正在与我的一位同事一起进行一个项目,该项目涉及使用tidyverse的dplyr软件包来操纵数据帧,但我注意到,即使我们使用相同的代码并相同的数据。
While I was working on a project with a colleague of mine, that involved using the package dplyr from tidyverse to manipulate a data frame, I've noticed that some of our results ware different even though we ware using the same code and the same data.
两个R会话的会话信息:
Session infos from both R sessions:
桌面:
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252
[2] LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
[4] purrr_0.3.3 readr_1.3.1 tidyr_1.0.0
[7] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
[10] sp_1.3-2
RStudio Cloud
> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomNames_1.4-0.0 plotly_4.9.2.1 lubridate_1.7.9
[4] openintro_2.0.0 usdata_0.1.0 cherryblossom_0.1.0
[7] airports_0.1.0 leaflet_2.0.3 forcats_0.5.0
[10] stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4
[13] readr_1.3.1 tidyr_1.1.0 tibble_3.0.2
[16] ggplot2_3.3.2 tidyverse_1.3.0 shinydashboard_0.7.1
[19] shiny_1.5.0
使用虹膜的可复制示例:
library(tidyverse)
#lets say that each flower on the data frame iris had a name
iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)
iris_big <- rbind(iris,iris[sample_index,])
我试图知道每个Specie瓷器有多少独特的花朵,所以我写了以下查询:
I was trying to know how many unique flowers of each Specie there ware so I wrote the following query:
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
count(Species)
问题是,它返回两个不同的结果,一个返回在我的桌面上,另一个返回在我朋友的桌面上(他正在使用Rstudio Cloud)。
The problem is, it returns two different results, one on my desktop and another on my friend's desktop (he was using Rstudio Cloud).
我的桌面:
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Rstudio云:
Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 83
2 versicolor 80
3 virginica 87
我最终通过使用以下查询解决了此问题:
I eventually fixed this issue by using the following querie:
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
select(Species) %>%
group_by(Species) %>%
count()
# A tibble: 3 x 2
# Groups: Species [3]
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
但是我想知道为什么会这样。
But I would like to know why this is happening.
推荐答案
(在前面,由于我第一个答案(关于R-3.5和R之间 sample.int
的变化-3.6)似乎仍然与为什么相同的查询为什么在不同的R会话上返回不同的结果 有关;这不是引起这种症状的原因,但是可能是因为问题的第一个版本使用的是 sample
。相反,真正的罪魁祸首是dplyr中版本的重大变化。)
(Up front, I'm submitting this as an alternate answer since my first answer (about the change in sample.int
between R-3.5 and R-3.6) still seems relevant to the question of "Why is the same query returning different results on different R sessions"; it is not what is causing this symptom, but it very easily could have been since the first version of your question used sample
. Instead, the real culprit here is due to an equally "major" version change in dplyr.)
您正在改变 dplyr :: count
的行为。
在dplyr-0.8.3中,?count
说:
In dplyr-0.8.3, ?count
says:
wt: (Optional) If omitted (and no variable named 'n' exists in
the data), will count the number of rows. If specified, will
perform a "weighted" tally by summing the (non-missing)
values of variable 'wt'. A column named 'n' (but not 'nn' or
'nnn') will be used as weighting variable by default in
'tally()', but not in 'count()'. This argument is
automatically quoted and later evaluated in the context of
the data frame. It supports unquoting. See
'vignette("programming")' for an introduction to these
concepts.
在dplyr-1.0.0中:
In dplyr-1.0.0:
wt: <'data-masking'> Frequency weights. Can be a variable (or
combination of variables) or 'NULL'. 'wt' is computed once
for each unique combination of the counted variables.
• If a variable, 'count()' will compute 'sum(wt)' for each
unique combination.
• If 'NULL', the default, the computation depends on
whether a column of frequency counts 'n' exists in the
data frame. If it exists, the counts are computed with
'sum(n)' for each unique combination. Otherwise, 'n()' is
used to compute the counts. Supply 'wt = n()' to force
this behaviour even if you have an 'n' column in the data
frame.
要看的重要部分是在0.8.3中,它说一个名为'n的'列'...将在... tally()中使用,但在'count()'''中不使用。但是,在1.0.0版中,它不包含该语言。我通过使用R-3.5.3 / dplyr-0.8.3和R-4.0.2 / dplyr-1.0.0复制了您的结果。
The important part to see is that in 0.8.3, it says that a "column named 'n' ... will be used ... in 'tally()' but not in 'count()'". However, in 1.0.0, it does not include that verbiage. I reproduced your results by using R-3.5.3/dplyr-0.8.3 and R-4.0.2/dplyr-1.0.0.
解决方法是以下两种方法之一:
The way around it is one of two ways:
-
使用
count(...,wt = n())
:
R.version$version.string
# [1] "R version 3.5.3 (2019-03-11)"
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
count(Species, wt = n())
# # A tibble: 3 x 2
# Species n
# <fct> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
R.version$version.string
# [1] "R version 4.0.2 (2020-06-22)"
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
count(Species, wt = n())
# # A tibble: 3 x 2
# Species n
# <fct> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
在分组内转换为使用 tally
,如
iris_big %>%
group_by(name,Species) %>%
count() %>%
group_by(Species) %>%
tally()
或您可以选择另一个选项:
Or you can go with another option:
- 意识到这是问题 dplyr#5298 ,该问题已在尚未发布的dplyr-1.0.1中修复(我不知道时间表)。这样,RStudio Cloud用户可以选择dplyr的github版本,以受益于 dplyr#5349 ,已经合并的PR。这应该将
count
的行为恢复为1.0.0之前的行为(尽管哈德利对此事的意见)。
- Realize that this is issue dplyr#5298, which is fixed in the yet-to-be-released dplyr-1.0.1 (I do not know a timeline). With that, the RStudio Cloud user can opt for the github version of dplyr to benefit from dplyr#5349, a PR that has already been merged. This should revert
count
's behavior back to the pre-1.0.0 behavior (despite Hadley's opinion on the matter).
这篇关于为什么同一查询使用dplyr在不同的R会话上返回不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!