选择分组数据的最小数据 - 保留所有列 [英] Select minimum data of grouped data - keeping all columns

查看:133
本文介绍了选择分组数据的最小数据 - 保留所有列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里跑到墙上。



我有一个数据框,很多行。
这是原理图示例。

  #myDf 
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....

我需要分组我的 dataframe 通过我的 ID ,然后选择具有最早日期的行,并将输出写入新的数据框 - 保留所有行。 / p>

  ID c1 c2 myDate 
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....

这是我如何处理:

  test<  -  myDf%>%
group_by(ID)%>%
mutate(date == as.Date(myDate,format =%d。%m。%Y))%>%
filter(date == min(b2))

To verfiy:我的结果的 nrow 数据框应该与独特返回相同。

  unique(myDf $ ID)%>%length == nrow(test)




FALSE


无效。我试过这个:

  newDf<  -  ddply(.data = myDf,
.variables =ID
.fun = function(piece){
take.this.row< - piece $ myDate%>%as.Date(format =%d。%m。%Y)%> ;%which.min
piece [take.this.row,]
})

这永远都会永远。我终止了它。



为什么第一种方法不起作用,什么是处理问题的好方法?

解决方案

考虑到你有一个很大的数据集,我认为使用data.table会更好!这是解决你的问题的data.table版本,它会比dplyr包快:

  library(data.table) 
df< - data.table(ID = c(A,A,A,B,B,C),c1 = 1:6,c2 = 1 :6,
myDate = c(01.01.2015,02.02.2014,
03.01.2014,09.09.2009,10.10.2010,06.06.2011 ))
df [,myDate:= as.Date(myDate,'%d。%m。%Y')]

> df_new< - df [df [,.I [myDate == min(myDate)],by = ID] $ V1]
> df_new
ID c1 c2 myDate
1:A 3 3 2014-01-03
2:B 4 4 2009-09-09
3:C 6 6 2011-06- 06

PS:您可以使用setDT(mydf)将data.frame转换为data.table。 / p>

I am running into a wall here.

I have a dataframe, many rows. Here is schematic example.

#myDf
ID    c1    c2    myDate
A     1     1     01.01.2015
A     2     2     02.02.2014
A     3     3     03.01.2014
B     4     4     09.09.2009
B     5     5     10.10.2010
C     6     6     06.06.2011
....

I need to group my dataframe by my ID, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.

ID    c1    c2    myDate
A     3     3     03.01.2014
B     4     4     09.09.2009
C     6     6     06.06.2011
....

That is how I approach it:

test <- myDf %>%
    group_by(ID) %>%
    mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
    filter(date == min(b2))

To verfiy: The nrow of my resulting dataframe should be the same as unique returns.

unique(myDf$ID) %>% length == nrow(test)

FALSE

Does not work. I tried this:

newDf <- ddply(.data = myDf,
              .variables = "ID",
              .fun = function(piece){
                  take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
                  piece[take.this.row,]
                  })

That does run forever. I terminated it.

Why is the first approach not working and what would be a good way to approach the problem?

解决方案

Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:

library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
                 myDate=c("01.01.2015","02.02.2014",
                          "03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]

> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
   ID c1 c2     myDate
1:  A  3  3 2014-01-03
2:  B  4  4 2009-09-09
3:  C  6  6 2011-06-06

PS: you can use setDT(mydf) to transform data.frame to data.table.

这篇关于选择分组数据的最小数据 - 保留所有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆