汇总相邻的行,忽略某些列 [英] Aggregate adjacent rows, ignoring certain columns
问题描述
我有一个类似下面的df
> head(df)
OrderId时间戳错误代码
1 3000000 1455594300434609920 NA
2 3000001 1455594300434614272 NA
3 3000000 1455594300440175104 0
4 3000001 1455594300440179712 0
5 3000002 1455594303468741120 NA
6 3000002 1455594303469326848 0
我需要折叠行,以使输出是某种东西如下所示
> head(df)
pre>
OrderId Timestamp1 Timestamp2 ErrorCode Diff
3000000 1455594300434609920 1455594300440175104 0
3000001 1455594300434614272 1455594300440179712 0
3000002 1455594303468741120 1455594303469326848 0
我用
df2 = aggregate(Timestamp〜。,df,FUN = toString)
但是输出是OrderId ErrorCode时间戳
10 3000001 0 1455594300440179712
11 3000002 0 1455594303469326848
12 3000003 0 1455594303713897984
当我删除ErrorCode列并使用相同的命令时,得到了预期的输出
> head(kf)
OrderId时间戳
1 3000000 1455594300434609920
2 3000001 1455594300434614272
3 3000000 1455594300440175104
43000001 1455594300440179712
5 3000002 1455594303468741120
6 3000002 1455594303469326848
> kf2 = aggregate(Timestamp〜。,kf,FUN = toString)
head(kf2)
OrderId时间戳
10 3000001 1455594300434614272,1455594300440179712
11 3000002 1455594303468741120,1455594303469326848
12 3000003 1455594303711330816,1455594303713897984
如何以上述方式汇总而不删除ErrorCode列。
解决方案我认为您实际上只是在将数据重塑为格式,分别为时间戳1和2提供单独的列。一种方法是,首先添加一个新列,该列定义测量的时间点,然后使用
reshape2
融合并转换数据。#为数据添加一个索引。
用于(i in unique(df $ OrderId)){
ii<-df $ OrderId == i
df $ time_ind [ii]<-seq_along(ii [ii])
}
library(reshape2 )
df_long< -melt(df,id.vars = c( OrderId, time_ind),
measure.vars = c( Timestamp, ErrorCode) )
dcast(df_long,OrderId〜variable + time_ind)
给你
OrderId Timestamp_1 Timestamp_2 ErrorCode_1 ErrorCode_2
1 3000000 1455594300434609920 1455594300440175104< NA> 0
2 3000001 1455594300434614272 1455594300440179712< NA> 0
3 3000002 1455594303468741120 1455594303469326848< NA> 0
I have a df like below
> head(df) OrderId Timestamp ErrorCode 1 3000000 1455594300434609920 NA 2 3000001 1455594300434614272 NA 3 3000000 1455594300440175104 0 4 3000001 1455594300440179712 0 5 3000002 1455594303468741120 NA 6 3000002 1455594303469326848 0
I need to collapse row in a way that output is something like below
> head(df) OrderId Timestamp1 Timestamp2 ErrorCode Diff 3000000 1455594300434609920 1455594300440175104 0 3000001 1455594300434614272 1455594300440179712 0 3000002 1455594303468741120 1455594303469326848 0
I used
df2=aggregate(Timestamp~.,df,FUN=toString)
But output isOrderId ErrorCode Timestamp 10 3000001 0 1455594300440179712 11 3000002 0 1455594303469326848 12 3000003 0 1455594303713897984
When I dropped the ErrorCode column and used the same command, I get an expected output
> head(kf) OrderId Timestamp 1 3000000 1455594300434609920 2 3000001 1455594300434614272 3 3000000 1455594300440175104 4 3000001 1455594300440179712 5 3000002 1455594303468741120 6 3000002 1455594303469326848 > kf2=aggregate(Timestamp~.,kf,FUN=toString) head(kf2) OrderId Timestamp 10 3000001 1455594300434614272, 1455594300440179712 11 3000002 1455594303468741120, 1455594303469326848 12 3000003 1455594303711330816, 1455594303713897984
How do I aggregate it in the above manner without removing ErrorCode column. There must be some little thing I am missing.
解决方案I take it you're actually looking just to reshape your data into a wide format with separate columns for timestamp 1 and 2. One way is to first add a new column that defines the time point of the measurement and then melt and cast the data using
reshape2
.# Add an index to the data.frame for (i in unique(df$OrderId)) { ii <- df$OrderId == i df$time_ind[ii] <- seq_along(ii[ii]) } library(reshape2) df_long <- melt(df, id.vars = c("OrderId", "time_ind"), measure.vars = c("Timestamp", "ErrorCode")) dcast(df_long, OrderId ~ variable + time_ind)
which will give you
OrderId Timestamp_1 Timestamp_2 ErrorCode_1 ErrorCode_2 1 3000000 1455594300434609920 1455594300440175104 <NA> 0 2 3000001 1455594300434614272 1455594300440179712 <NA> 0 3 3000002 1455594303468741120 1455594303469326848 <NA> 0
这篇关于汇总相邻的行,忽略某些列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!