R聚合使用来自相同数据的子集给出结构不同的结果 [英] R aggregate gives differently structured results using subsets from the same data
问题描述
我正在基于几年每小时数据的数据帧(球)进行风速的昼夜循环。我想按季节绘制它们,所以我将需要的日期子集化,并像这样加入它们:
I'm making diurnal cycles of windspeed based on a dataframe (ball) of several year's hourly data. I want to plot them by season, so I subset out the dates I need and join them like this:
b8 = subset(ball, as.Date(date)>="2008-09-01 00:00:00, GMT" & as.Date(date)<= "2008-11-30 23:00:00, GMT" )
b9 = subset(ball, as.Date(date)>="2009-09-01 00:00:00, GMT" & as.Date(date)<= "2009-11-30 23:00:00, GMT" )
b10 = subset(ball, as.Date(date)>="2010-09-01 00:00:00, GMT" & as.Date(date)<= "2010-11-30 23:00:00, GMT")
ballspr = rbind(b8,b9,b10)
然后我使用以下方法获得一个昼夜周期:
I then get a diurnal cycle using this:
sprwsdiurnal <- aggregate(ballspr["ws"], format(ballspr["date"],"%H"),summary, na.rm=T)
在四个季节中的三个季节中,使该对象具有以下结构:
For three out of four seasons this make an object with this structure:
date ws
1 00 0.200, 1.000, 1.600, 2.021, 2.500, 8.000, 5.000
2 01 0.100, 1.000, 1.600, 1.988, 2.500, 8.600, 1.000
3 02 0.100, 1.000, 1.700, 1.982, 2.600, 8.900, 1.000
...直到24小时...
...through to 24 hours...
23 22 0.100, 1.200, 1.800, 2.222, 2.950, 9.100, 1.000
24 23 0.100, 1.000, 1.600, 2.072, 2.700, 8.800, 1.000
这就是我想要的箱形图将与此一起工作:
This is what I want as boxplot will work with this:
par( mar = c(5, 5, 2, 2))
boxplot(sprwsdiurnal$ws, col="dodger blue",pch=16,font.lab=2,cex.lab=1.5,cex.axis=2,xlab="Hour",range=0, ylab=quote(Windspeed ~ "(" * m ~ s ^-1 * ")"),xaxt="n",main="Spring")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5, font.lab=2)
麻烦是一个季节出现了,像这样:
The trouble is one season comes out like this:
date ws.Min. ws.1st Qu. ws.Median ws.Mean ws.3rd Qu. ws.Max. ws.NA's
1 00 0.000 1.300 2.100 2.539 3.200 10.500 2.000
2 01 0.100 1.275 2.100 2.499 3.200 9.800 2.000
3 02 0.200 1.200 2.000 2.514 3.400 9.000 2.000
...直到24小时...
...through to 24 hours...
23 22 0.100 1.200 1.950 2.582 3.325 11.900 2.000
24 23 0.100 1.300 2.000 2.585 3.400 11.200 2.000
Boxplot不适用于此格式。我无法解释为什么每个季节的所有代码都相同并且它们是从同一数据帧中子集出来的。为什么会有不同的结果?任何想法都很感激。
Boxplot does not work with this format. I can't explain why this happens, when all the code for each season is the same and they are being subsetted from the same dataframe. Why does one come out differently? Any ideas appreciated.
编辑:这里是数据。我检查了这两个季节,它们仍然提供上面显示的两种不同格式。
Here's the data. I've checked these two seasons and they still give the two different formats shown above.
https://www.dropbox.com/s/v5kss0bgjyhrtw1/ball.csv
ball=read.csv("ball.csv", header=T)
ball$date = as.POSIXct(strptime(ball$date, format = "%Y-%m-%d %H:%M:%S", "GMT"))
win9 = subset(ball, as.Date(date)>="2009-06-01 00:00:00, GMT" & as.Date(date)<= "2009-08-31 23:00:00, GMT" )
aut9 = subset(ball, as.Date(date)>="2009-03-01 00:00:00, GMT" & as.Date(date)<= "2009-05-31 23:00:00, GMT" )
spr9 = subset(ball, as.Date(date)>="2009-09-01 00:00:00, GMT" & as.Date(date)<= "2009-11-30 23:00:00, GMT" )
sum9 = subset(ball, as.Date(date)>="2008-12-01 00:00:00, GMT" & as.Date(date)<= "2009-02-28 23:00:00, GMT" )
sprdiurnal <- aggregate(spr9["ws"], format(spr9["date"],"%H"),summary, na.rm=T)
par( mar = c(5, 5, 4, 2))
boxplot(sprdiurnal$ws, col=colours()[109],pch=16,cex.lab=1.5,cex.axis=1.5,xlab="Hour",range=0, ylab=quote(Wind ~ speed ~ "(" * m * "s" ^-1 * ")"),xaxt="n",main="")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5)
windiurnal <- aggregate(win9["ws"], format(win9["date"],"%H"),summary, na.rm=T)
par( mar = c(5, 5, 4, 2))
boxplot(windiurnal$ws, col=colours()[109],pch=16,cex.lab=1.5,cex.axis=1.5,xlab="Hour",range=0, ylab=quote(Wind ~ speed ~ "(" * m * "s" ^-1 * ")"),xaxt="n",main="")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5)
推荐答案
据我所知,问题是 summary
在您的汇总
函数中用于 sprdiurnal
的结果是一个矩形数据集,R存储为 matrix
,而对于其他子集,因为某些时间包含 NA
而不是其他数据集不是矩形的,因此R将摘要存储为列表
。
The "problem", so far as I can tell, is that the result of summary
in your aggregate
function for "sprdiurnal
" results in a rectangular dataset that R stores as a matrix
, while for your other subsets, since some hours include NA
and others don't the dataset is not rectangular, so R stores the summary as a list
.
我将使用 iris数据集进行演示,但首先,我还将创建一个具有一个 NA
value。
I'll demonstrate with the "iris" dataset, but first, I'll also create an "iris_2" dataset that has one NA
value.
iris_2 <- iris
iris_2$Sepal.Length[10] <- NA
让我们比较汇总输出,在这些情况下,汇总输出将仅作为第二列。您会看到没有缺失值的 iris数据集在您的 data.frame
中返回一个矩形矩阵作为第二个列。由于我们有一个 NA
值,因此, iris_2数据集存储为 list
,这就是您出于您的特定目的。
Let's compare the aggregation output, which in these cases will just be the second column. You'll see that the "iris" dataset, which has no missing values, returns a rectangular matrix as the second "column" in your data.frame
. Because of our one NA
value, the "iris_2" dataset, however, gets stored as a list
, which is what you want for your particular purpose.
(irisagg <- aggregate(iris["Sepal.Length"], iris["Species"], summary))[[2]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# [1,] 4.3 4.800 5.0 5.006 5.2 5.8
# [2,] 4.9 5.600 5.9 5.936 6.3 7.0
# [3,] 4.9 6.225 6.5 6.588 6.9 7.9
(iris_2agg <- aggregate(iris_2["Sepal.Length"], iris_2["Species"], summary))[[2]]
# $`0`
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 4.300 4.800 5.000 5.008 5.200 5.800 1
#
# $`1`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 5.600 5.900 5.936 6.300 7.000
#
# $`2`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 6.225 6.500 6.588 6.900 7.900
这是我们将其重新放入列表的方式。
Here's how we would put it back into a list.
irisagg$Summary <- unlist(apply(irisagg[[2]], 1, list), recursive = FALSE)
irisagg$Summary
# [[1]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.300 4.800 5.000 5.006 5.200 5.800
#
# [[2]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 5.600 5.900 5.936 6.300 7.000
#
# [[3]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 6.225 6.500 6.588 6.900 7.900
当然,更直接的方法是对汇总
使用简化
参数并执行:
Of course, a much more direct approach would be to make use of the simplify
argument for aggregate
and do:
(iris_3agg <- aggregate(iris["Sepal.Length"],
iris["Species"], summary,
simplify = FALSE))[[2]]
# $`0`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.300 4.800 5.000 5.006 5.200 5.800
#
# $`1`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 5.600 5.900 5.936 6.300 7.000
#
# $`2`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 6.225 6.500 6.588 6.900 7.900
将其应用于您的示例时, sprdiurnal为给您带来麻烦的子集。单独查看 sprdiurnal $ ws
并确认它是一个矩阵。让我们将其转换为列表。
Applying it to your example, "sprdiurnal" is the subset that's giving you trouble. View sprdiurnal$ws
by itself and verify that it's a matrix. Let's convert it to a list.
sprdiurnal$ws2 <- unlist(apply(sprdiurnal$ws, 1, list), recursive=FALSE)
现在您可以进行 boxplot
与其他季节一样。
Now you can proceed with boxplot
as you were doing with the other seasons.
boxplot(sprdiurnal$ws2, e..t..c...)
或者重新制作 sprdiurnal
对象使用:
Or, remake your sprdiurnal
object using:
sprdiurnal <- aggregate(spr9["ws"],
format(spr9["date"],"%H"),
summary, na.rm = TRUE,
simplify = FALSE)
并像以前一样进行。
这篇关于R聚合使用来自相同数据的子集给出结构不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!