R聚合使用来自相同数据的子集给出结构不同的结果 [英] R aggregate gives differently structured results using subsets from the same data

查看:126
本文介绍了R聚合使用来自相同数据的子集给出结构不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在基于几年每小时数据的数据帧(球)进行风速的昼夜循环。我想按季节绘制它们,所以我将需要的日期子集化,并像这样加入它们:

I'm making diurnal cycles of windspeed based on a dataframe (ball) of several year's hourly data. I want to plot them by season, so I subset out the dates I need and join them like this:

b8 = subset(ball, as.Date(date)>="2008-09-01 00:00:00, GMT" & as.Date(date)<= "2008-11-30 23:00:00, GMT"  )
b9  = subset(ball, as.Date(date)>="2009-09-01 00:00:00, GMT" & as.Date(date)<= "2009-11-30 23:00:00, GMT"  )
b10 = subset(ball,  as.Date(date)>="2010-09-01 00:00:00, GMT" & as.Date(date)<= "2010-11-30 23:00:00, GMT")
ballspr = rbind(b8,b9,b10)

然后我使用以下方法获得一个昼夜周期:

I then get a diurnal cycle using this:

sprwsdiurnal <- aggregate(ballspr["ws"], format(ballspr["date"],"%H"),summary, na.rm=T)

在四个季节中的三个季节中,使该对象具有以下结构:

For three out of four seasons this make an object with this structure:

   date                                               ws
1    00  0.200, 1.000, 1.600, 2.021, 2.500, 8.000, 5.000
2    01  0.100, 1.000, 1.600, 1.988, 2.500, 8.600, 1.000
3    02  0.100, 1.000, 1.700, 1.982, 2.600, 8.900, 1.000

...直到24小时...

...through to 24 hours...

23   22  0.100, 1.200, 1.800, 2.222, 2.950, 9.100, 1.000
24   23  0.100, 1.000, 1.600, 2.072, 2.700, 8.800, 1.000

这就是我想要的箱形图将与此一起工作:

This is what I want as boxplot will work with this:

par(  mar = c(5, 5, 2, 2))
boxplot(sprwsdiurnal$ws, col="dodger blue",pch=16,font.lab=2,cex.lab=1.5,cex.axis=2,xlab="Hour",range=0, ylab=quote(Windspeed ~ "(" * m ~ s ^-1 * ")"),xaxt="n",main="Spring")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5, font.lab=2)

麻烦是一个季节出现了,像这样:

The trouble is one season comes out like this:

      date ws.Min. ws.1st Qu. ws.Median ws.Mean ws.3rd Qu. ws.Max. ws.NA's
1    00   0.000      1.300     2.100   2.539      3.200  10.500   2.000
2    01   0.100      1.275     2.100   2.499      3.200   9.800   2.000
3    02   0.200      1.200     2.000   2.514      3.400   9.000   2.000

...直到24小时...

...through to 24 hours...

23   22   0.100      1.200     1.950   2.582      3.325  11.900   2.000
24   23   0.100      1.300     2.000   2.585      3.400  11.200   2.000

Boxplot不适用于此格式。我无法解释为什么每个季节的所有代码都相同并且它们是从同一数据帧中子集出来的。为什么会有不同的结果?任何想法都很感激。

Boxplot does not work with this format. I can't explain why this happens, when all the code for each season is the same and they are being subsetted from the same dataframe. Why does one come out differently? Any ideas appreciated.

编辑:这里是数据。我检查了这两个季节,它们仍然提供上面显示的两种不同格式。

Here's the data. I've checked these two seasons and they still give the two different formats shown above.

https://www.dropbox.com/s/v5kss0bgjyhrtw1/ball.csv

ball=read.csv("ball.csv", header=T)
ball$date = as.POSIXct(strptime(ball$date, format = "%Y-%m-%d %H:%M:%S", "GMT"))

win9  = subset(ball, as.Date(date)>="2009-06-01 00:00:00, GMT" & as.Date(date)<= "2009-08-31 23:00:00, GMT"  )
aut9  = subset(ball, as.Date(date)>="2009-03-01 00:00:00, GMT" & as.Date(date)<= "2009-05-31 23:00:00, GMT"  )
spr9  = subset(ball, as.Date(date)>="2009-09-01 00:00:00, GMT" & as.Date(date)<= "2009-11-30 23:00:00, GMT"  )
sum9  = subset(ball, as.Date(date)>="2008-12-01 00:00:00, GMT" & as.Date(date)<= "2009-02-28 23:00:00, GMT"  )


sprdiurnal <- aggregate(spr9["ws"], format(spr9["date"],"%H"),summary, na.rm=T)
par(  mar = c(5, 5, 4, 2))
 boxplot(sprdiurnal$ws, col=colours()[109],pch=16,cex.lab=1.5,cex.axis=1.5,xlab="Hour",range=0, ylab=quote(Wind ~ speed ~ "(" * m * "s" ^-1 * ")"),xaxt="n",main="")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5) 

windiurnal <- aggregate(win9["ws"], format(win9["date"],"%H"),summary, na.rm=T)
par(  mar = c(5, 5, 4, 2))
boxplot(windiurnal$ws, col=colours()[109],pch=16,cex.lab=1.5,cex.axis=1.5,xlab="Hour",range=0, ylab=quote(Wind ~ speed ~ "(" * m * "s" ^-1 * ")"),xaxt="n",main="")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5)


推荐答案

据我所知,问题是 summary 在您的汇总函数中用于 sprdiurnal 的结果是一个矩形数据集,R存储为 matrix ,而对于其他子集,因为某些时间包含 NA 而不是其他数据集不是矩形的,因此R将摘要存储为列表

The "problem", so far as I can tell, is that the result of summary in your aggregate function for "sprdiurnal" results in a rectangular dataset that R stores as a matrix, while for your other subsets, since some hours include NA and others don't the dataset is not rectangular, so R stores the summary as a list.

我将使用 iris数据集进行演示,但首先,我还将创建一个具有一个 NA value。

I'll demonstrate with the "iris" dataset, but first, I'll also create an "iris_2" dataset that has one NA value.

iris_2 <- iris
iris_2$Sepal.Length[10] <- NA

让我们比较汇总输出,在这些情况下,汇总输出将仅作为第二列。您会看到没有缺失值的 iris数据集在您的 data.frame 中返回一个矩形矩阵作为第二个列。由于我们有一个 NA 值,因此, iris_2数据集存储为 list ,这就是出于您的特定目的。

Let's compare the aggregation output, which in these cases will just be the second column. You'll see that the "iris" dataset, which has no missing values, returns a rectangular matrix as the second "column" in your data.frame. Because of our one NA value, the "iris_2" dataset, however, gets stored as a list, which is what you want for your particular purpose.

(irisagg <- aggregate(iris["Sepal.Length"], iris["Species"], summary))[[2]]
#      Min. 1st Qu. Median  Mean 3rd Qu. Max.
# [1,]  4.3   4.800    5.0 5.006     5.2  5.8
# [2,]  4.9   5.600    5.9 5.936     6.3  7.0
# [3,]  4.9   6.225    6.5 6.588     6.9  7.9
(iris_2agg <- aggregate(iris_2["Sepal.Length"], iris_2["Species"], summary))[[2]]
# $`0`
#     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    4.300   4.800   5.000   5.008   5.200   5.800       1 
# 
# $`1`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   5.600   5.900   5.936   6.300   7.000 
# 
# $`2`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   6.225   6.500   6.588   6.900   7.900 

这是我们将其重新放入列表的方式。

Here's how we would put it back into a list.

irisagg$Summary <- unlist(apply(irisagg[[2]], 1, list), recursive = FALSE)
irisagg$Summary
# [[1]]
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.300   4.800   5.000   5.006   5.200   5.800 
# 
# [[2]]
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   5.600   5.900   5.936   6.300   7.000 
# 
# [[3]]
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   6.225   6.500   6.588   6.900   7.900 

当然,更直接的方法是对汇总使用简化参数并执行:

Of course, a much more direct approach would be to make use of the simplify argument for aggregate and do:

(iris_3agg <- aggregate(iris["Sepal.Length"], 
                        iris["Species"], summary, 
                        simplify = FALSE))[[2]]
# $`0`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.300   4.800   5.000   5.006   5.200   5.800 
# 
# $`1`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   5.600   5.900   5.936   6.300   7.000 
# 
# $`2`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   6.225   6.500   6.588   6.900   7.900 

将其应用于您的示例时, sprdiurnal为给您带来麻烦的子集。单独查看 sprdiurnal $ ws 并确认它是一个矩阵。让我们将其转换为列表。

Applying it to your example, "sprdiurnal" is the subset that's giving you trouble. View sprdiurnal$ws by itself and verify that it's a matrix. Let's convert it to a list.

sprdiurnal$ws2 <- unlist(apply(sprdiurnal$ws, 1, list), recursive=FALSE)

现在您可以进行 boxplot 与其他季节一样。

Now you can proceed with boxplot as you were doing with the other seasons.

boxplot(sprdiurnal$ws2, e..t..c...) 

或者重新制作 sprdiurnal 对象使用:

Or, remake your sprdiurnal object using:

sprdiurnal <- aggregate(spr9["ws"], 
                        format(spr9["date"],"%H"), 
                        summary, na.rm = TRUE, 
                        simplify = FALSE)

并像以前一样进行。

这篇关于R聚合使用来自相同数据的子集给出结构不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆