选择分组数据的最小数据 - 保留所有列 [英] Select minimum data of grouped data - keeping all columns
问题描述
我有一个数据框
,很多行。
这是原理图示例。
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
我需要分组我的 dataframe
通过我的 ID
,然后选择具有最早日期的行,并将输出写入新的数据框 - 保留所有行。 / p>
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
这是我如何处理:
test< - myDf%>%
group_by(ID)%>%
mutate(date == as.Date(myDate,format =%d。%m。%Y))%>%
filter(date == min(b2))
To verfiy:我的结果的 nrow
数据框应该与独特
返回相同。
unique(myDf $ ID)%>%length == nrow(test)
FALSE
无效。我试过这个:
newDf< - ddply(.data = myDf,
.variables =ID
.fun = function(piece){
take.this.row< - piece $ myDate%>%as.Date(format =%d。%m。%Y)%> ;%which.min
piece [take.this.row,]
})
这永远都会永远。我终止了它。
为什么第一种方法不起作用,什么是处理问题的好方法?
考虑到你有一个很大的数据集,我认为使用data.table会更好!这是解决你的问题的data.table版本,它会比dplyr包快:
library(data.table)
df< - data.table(ID = c(A,A,A,B,B,C),c1 = 1:6,c2 = 1 :6,
myDate = c(01.01.2015,02.02.2014,
03.01.2014,09.09.2009,10.10.2010,06.06.2011 ))
df [,myDate:= as.Date(myDate,'%d。%m。%Y')]
> df_new< - df [df [,.I [myDate == min(myDate)],by = ID] $ V1]
> df_new
ID c1 c2 myDate
1:A 3 3 2014-01-03
2:B 4 4 2009-09-09
3:C 6 6 2011-06- 06
PS:您可以使用setDT(mydf)将data.frame转换为data.table。 / p>
I am running into a wall here.
I have a dataframe
, many rows.
Here is schematic example.
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
I need to group my dataframe
by my ID
, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
That is how I approach it:
test <- myDf %>%
group_by(ID) %>%
mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
filter(date == min(b2))
To verfiy: The nrow
of my resulting dataframe should be the same as unique
returns.
unique(myDf$ID) %>% length == nrow(test)
FALSE
Does not work. I tried this:
newDf <- ddply(.data = myDf,
.variables = "ID",
.fun = function(piece){
take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
piece[take.this.row,]
})
That does run forever. I terminated it.
Why is the first approach not working and what would be a good way to approach the problem?
Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:
library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
myDate=c("01.01.2015","02.02.2014",
"03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]
> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
ID c1 c2 myDate
1: A 3 3 2014-01-03
2: B 4 4 2009-09-09
3: C 6 6 2011-06-06
PS: you can use setDT(mydf) to transform data.frame to data.table.
这篇关于选择分组数据的最小数据 - 保留所有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!