为data.frame中的列的每个唯一值有效地选择最大行数 [英] Efficiently selecting top number of rows for each unique value of a column in a data.frame

查看:154
本文介绍了为data.frame中的列的每个唯一值有效地选择最大行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据值的发生取一个数据帧的子集。这在下面给出的示例中是最好的解释。这个问题与:为R中的数据名称中的列的每个唯一值选择顶部有限数量
但是,我想改变头部选择的项目数( )命令。

  #Sample data 
input< - matrix(c(1000001,1000001,1000001,1000001, 1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,1000082011-01-01,2011 -01-02,2011-01-01,2011-01-04,2011-01-01,2011-01-02,2011-01-01,2011-01 -04,2011-01-01,2011-01-02,2011-01-01,2011-01-04),ncol = 3)
colnames(input)< ; - c(Product,Something,Date)
input< - as.data.frame(input)
input $ Date< - as.Date(input [日期],%Y-%m-%d)

#Sort基于日期,我想要删除最旧的d吃了
input< - 输入[with(input,order(Date))]]

#创建要选择的项目数量
table_input< - as.data。框架(表(输入$ Product))
table_input $ twentyfive< - ceiling(table_input $ Freq * 0.25)

#下一部分是一个非常耗时的方法,90k不同的产品)

首先< - TRUE

for(i in table_input $ Var1){
data_selected < - input [input $ Product == i $]
number< - table_input [table_input $ Var1 == i,] $ twentyfive

head< - head(data_selected,number)

if (first == FALSE){
output < - rbind(output,head)
} else {
output < - head
}
first< FALSE
}

希望有人知道更好,更有效的方式。我试图从这里的答案中使用split函数:在R中的数据名称中为列的每个唯一值选择最大有限行数,以分割产品,然后尝试迭代,然后选择头()。但是分割函数总是耗尽内存(不能分配..)

  input_split<  -  split(输入,输入$ Product )#Works在这里,但不是我的问题。 

所以最后我的问题是我想要选择不同数量的每个唯一的产品。所以这里有2个项目从1000001和1个项目从1000002和1000003。

两个解决方案。 plyr :: ddply 是为您的需要而设计的,但使用 data.table 将会更快。



你想要一个 data.frame 将其拆分成块,删除所有最下面的25%的行每个块按日期排序并重新组合成一个 data.frame 。这可以在一个简单的行中完成...

  require(plyr)
ddply(input,。 ),function(x)x [ - c(1:ceiling(nrow(x)* 0.25))]]
#产品日期
#1 1000001 100005 2011-01-01
#2 1000001 100002 2011-01-02
#3 1000001 100006 2011-01-02
#4 1000001 100004 2011-01-04
#5 1000002 100007 2011-01-01
#6 1000002 100003 2011-01-04
#7 1000003 100002 2011-01-02
#8 1000003 100008 2011-01-04
/ pre>

data.table 解决方案



对于 data.table ,您将需要最新的开发版本,从 r-forge (由于我们的CRAN版本的data.table中的负号下标未被实现)。确保您遵循 install.package 来获取最新版本...

  install.packages(data.table,repos =http://r-forge.r-project.org)
require(data.table)
DT< - data.table(input)

#按产品排序然后日期非常快
setkeyv(DT,c(Product,Date))

#返回底部75%的行(即不是最早的)
DT [,tail(.SD,-ceiling(nrow(.SD)* .25)),by = Product]
#日期
#1:1000001 100005 2011-01-01
#2:1000001 100002 2011-01-02
#3:1000001 100006 2011-01-02
#4: 1000001 100004 2011-01-04
#5:1000002 100007 2011-01-01
#6:1000002 100003 2011-01-04
#7:1000003 100002 2011-01-02
#8:1000003 100008 2011-01-04



使用更好的方法 data.table



你可以更容易地做到这一点(所以你不需要开发版本的 data.table )...

  DT [,.SD [-c(1:ceiling(.25 * .N))],by = Product] 

您也可以在 j 参数中使用 lapply (我担心我使用 .SD ),并且在约$ 14秒内运行在$ code> data.table 2e6行与90,000个产品(组)...


  set.seed(1)
产品< - sample(1:9e5,2e6,repl = TRUE)
日期< - sample(1:20,2e6,repl = TRUE)
日期< - as.Date(Sys.Date()+日期)
DT< - data.table = Product,Date = Date)

system.time({setkeyv(DT,c(Product,Date)); DT [,lapply(.SD,`[`,-c(1:ceiling(.25 * .N)))by = Product]})
#用户系统经过
#14.65 0.03 14.74



更新:使用数据的最佳方式



所以感谢 @Arun (谁现在是$ code的作者> data.table package)我们现在有最好的方法来使用 data.table 这是使用 .I 它是所有行索引的整数向量, [中的子集通过删除的前25%的记录 - ( 1:ceiling(.N * .25)),然后使用这些行索引执行子集来获取最终的表。这比使用我上面的 .SD 方法快了4-5倍。令人惊讶的东西!

  system.time(DT [,.I [ - (1:ceiling(.N * .25 ))],by = Product] $ V1])$ ​​b $ b用户系统已用
3.02 0.00 3.03


I am trying to take a subset of a data frame, based on the occurence of a value. This is best explained in an example, given below. This question has a high relation to: Selecting top finite number of rows for each unique value of a column in a data fame in R However, i want to vary the number of items selected by the head() command.

#Sample data
input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
colnames(input) <- c( "Product" , "Something" ,"Date")
input <- as.data.frame(input)
input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")

#Sort based on date, I want to leave out the entries with the oldest dates.
input <- input[ with( input, order(Date)), ]

#Create number of items I want to select
table_input <- as.data.frame(table(input$Product))
table_input$twentyfive <- ceiling( table_input$Freq*0.25  )

#This next part is a very time consuming method (Have 2 mln rows, 90k different products)

first <- TRUE

for( i in table_input$Var1 ) {
  data_selected <- input[input$Product == i,]
  number <- table_input[table_input$Var1 == i ,]$twentyfive

  head <- head( data_selected, number)        

  if( first == FALSE) {
    output <- rbind(output, head)
  } else {
    output <- head
  }
  first <- FALSE
}

Hoping that someone knows a better, more efficient way. I tried to use the split function from the answer here: Selecting top finite number of rows for each unique value of a column in a data fame in R to split on the products and then try to iterate over them and select the head(). However the split function always runs out of memory (cannot allocate ..)

input_split <- split(input, input$Product) #Works here, but not i my problem.

So in the end my problem is that i want te select a different amount of each unique Product. So here 2 items from 1000001 and 1 item from 1000002 and 1000003.

解决方案

Two solutions spring to mind. plyr::ddply is designed for your needs but using a data.table will be waaaaaay faster.

You want to take a data.frame split it up into chunks, remove all the bottom 25% of rows of each chunk which is sorted by date and recombine into a data.frame. This can be accomplished in one simple line...

require( plyr )
ddply( input , .(Product) , function(x) x[ - c( 1 : ceiling( nrow(x) * 0.25 ) ) , ] )
#  Product Something       Date
#1 1000001    100005 2011-01-01
#2 1000001    100002 2011-01-02
#3 1000001    100006 2011-01-02
#4 1000001    100004 2011-01-04
#5 1000002    100007 2011-01-01
#6 1000002    100003 2011-01-04
#7 1000003    100002 2011-01-02
#8 1000003    100008 2011-01-04

data.table solution

For data.table you will need the latest development version from r-forge (due to us of negative subscript not being implemented in the CRAN version of data.table yet). Make sure you follow the install.package call to get the latest version...

install.packages( "data.table" , repos="http://r-forge.r-project.org" )
require( data.table )
DT <- data.table( input )

#  Sort by Product then Date very quickly
setkeyv( DT , c( "Product" , "Date" ) )

#  Return the bottom 75% of rows (i.e. not the earliest)
DT[ ,  tail( .SD , -ceiling( nrow(.SD) * .25 ) )  , by = Product ] 
#   Product Something       Date
#1: 1000001    100005 2011-01-01
#2: 1000001    100002 2011-01-02
#3: 1000001    100006 2011-01-02
#4: 1000001    100004 2011-01-04
#5: 1000002    100007 2011-01-01
#6: 1000002    100003 2011-01-04
#7: 1000003    100002 2011-01-02
#8: 1000003    100008 2011-01-04

A better way to use data.table

You could more easily do this (so you don't require development version of data.table)...

DT[ ,  .SD[ -c( 1:ceiling( .25 * .N ) ) ] , by = Product ] 

And you can also use lapply in the j argument (I was worried about my use of .SD) and this runs in ~ 14 seconds on a data.table of 2e6 rows with 90,000 products (groups)...

set.seed(1)
Product <- sample( 1:9e5 , 2e6 , repl = TRUE )
dates <- sample( 1:20 , 2e6 , repl = TRUE )
Date <- as.Date( Sys.Date() + dates )
DT <- data.table( Product = Product , Date = Date )

system.time( { setkeyv( DT , c( "Product" , "Date" ) ); DT[ , lapply( .SD , `[` ,  -c( 1:ceiling( .25 * .N ) ) ) , by = Product ] } )
#   user  system elapsed 
#  14.65    0.03   14.74 

Update: The best way to use data.table!

So thanks to @Arun (who is now an author of the data.table package) we now have the best way to use data.table which is to use .I which is an integer vector of all the row indices, subset in [ by removing the first 25% of record with -(1:ceiling(.N*.25)), and then performaing a subset using these row indices to get the final table. This is ~ 4-5 times faster than using my .SD method above. Amazing stuff!

system.time( DT[ DT[, .I[-(1:ceiling(.N*.25))] , by = Product]$V1] )
   user  system elapsed 
   3.02    0.00    3.03

这篇关于为data.frame中的列的每个唯一值有效地选择最大行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆