使用条件R从给定的数据框创建样本集 [英] Creating Sets of Samples From Given dataframe using condition R

查看:132
本文介绍了使用条件R从给定的数据框创建样本集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有输入表有超过750 K的生物。它有一个字段称为季度。我想创建样本,以便从每个季度获得10%的记录。 data.frame的主要属性是:


  1. SERIAL_NBR

  2. MODELNO li>
  3. War.Start.Monthly

Start.Qua.Yr是四分之一被提及。有什么办法可以生成每个季度有数据(记录的10%)的样本数据吗?



使用示例函数我可以得到样本,不管季度。代码为:

  raw_claim_input [sample(1:nrow(raw_claim_input))as.integer(nrow(raw_claim_input) / 10)),] 

当我在四分之一的时间内完成关注,我没有得到预期的结果在考虑价值时有一个逻辑问题

  raw_claim_input [sample(1:nrow(raw_claim_input [raw_claim_input $ War.Start.Monthly == 08-M2,]),as.integer(nrow(raw_claim_input [raw_claim_input $ War.Start.Monthly ==08-M2,])/ 10)),] 

值为08-M2的是过滤器,我想为可用的所有值执行此操作。 War.Start.Monthly有70个值,我想为War.Start.Monthly的每个值生成样本。



部分数据

  Day.Covered SHIP_DATE Warranty.Start.Qua.Yr War.Start.Monthly AssemblyDateUpdated Warranty.End.Date Warranty.End.Qur.Yr War。 End.Monthly 
252754 365 06-04-2008 00:00 08-Q2 08-M6 06-03-2008 00:00 08-04-2064 64-Q2 64-M4
441605 1095 08- 17-2010 11:13:07 10-Q3 10-M8 08-16-2010 12:09:57 08-04-2064 64-Q2 64-M4
583636 731 10-17-2012 00:00: 00 12-Q4 12-M10 10-16-2012 00:00:00 08-04-2064 64-Q2 64-M4
115586 731 01-04-2013 00:00 13-Q1 13-M1 01- 03-2013 00:00 08-04-2064 64-Q2 64-M4
334221 1095 06-13-2011 12:29:23 11-Q2 11-M6 06-11-2011 11:25 08-04-2064 64-Q2 64-M4
146656 1095 03-16-2011 10:54:37 11-Q1 11-M3 03-15 -2011 08:14:40 08-04-2064 64-Q2 64-M4
249956 1095 06-18-2008 12:35:06 08-Q2 08-M6 06-06-2008 10:51 08- 04-2064 64-Q2 64-M4
276295 731 05-18-2011 00:00:00 11-Q2 11-M5 05-18-2011 00:00:00 19-11-2014 14-Q4 14 -M11
582423 731 10-22-2012 00:00:00 12-Q4 12-M10 10-22-2012 00:00:00 08-04-2064 64-Q2 64-M4
380369 730 08-04-2009 17:43 09-Q3 09-M7 07-31-2009 07:14:17 18-01-2012 12-Q1 12-M1

如果需要更多细节,请让我知道。

解决方案

这样做:

  X<  -  read.csv(text = Day.Covered,SHIP_DATE,Warranty.Start.Qua.Yr,War.Start.Monthly,AssemblyDateUpdated,Warranty.End.Date,Warranty.End.Qur.Yr,War.End.Monthly 
365,06- 04-2008 00:00,08-Q2,08-M6,06-03-2008 00:00,08-04-2064,64-Q2,64-M4
1095,08-17-2010 11: 13:07,10-Q3,10-M8,08-16-2010 12:09:57,08-04-2064,64-Q2,64-M4
731,10-17-2012 00:00 :00,12-Q4,12-M10,10-16-2012 00:00:00,08-04-2064,64-Q2,64-M4
731,01-04-2013 00:00, 13-Q1,13-M1,01-03-2013 00:00,08-04-2064,64-Q2,64-M4
1095,06-13-2011 12:29:23,11-Q2 ,11-M6,06-11-2011 11:25,08-04-2064,64-Q2,64-M4
1095,03-16-2011 10:54:37,11-Q1,11- M3,03-15-2011 08:14:40,08-04-2064,64-Q2,64-M4
1095,06-18-2008 12:35:06,08-Q2,08-M6 ,06-06-2008 10:51,08-04-2064,64-Q2,64-M4
731,05-18-2011 00:00:00,11-Q2,11-M5,05- 18-2011 00:00:00,19-11-2014,14-Q4,14-M11
731,10-22-201 2 00:00:00,12-Q4,12-M10,10-22-2012 00:00:00,08-04-2064,64-Q2,64-M4
730,08-04-2009 17:43,09-Q3,09-M7,07-31-2009 07:14:17,18-01-2012,12-Q1,12-M)

#复制X到有足够的数据为这个例子。
X< - X [rep(seq(nrow(X)),100)]]

#根据季度对数据进行分区。
分区< - split(X,X $ Warranty.Start.Qua.Yr)
#从每个分区中抽取样本。
samples< - lapply(partitions,function(p)p [sample(nrow(p),nrow(p)/ 10))]


I have input table having more than 750 K raws. It has a field called quarter. I want to create sample such that I get 10% records from each quarter. Main attributes of the data.frame are:

  1. "SERIAL_NBR"
  2. "MODELNO"
  3. "War.Start.Monthly"

"Start.Qua.Yr" is the field where quarter is mentioned. Is there any way through which I can generate sample data which has data(10% of record) for each quarter?

Using sample function I can get sample regardless of the quarter. Code for the same will be:

raw_claim_input[sample(1:nrow(raw_claim_input),as.integer(nrow(raw_claim_input)/10)),]

When I am doing following for one quarter I am not getting expected results as there a logical problem while considering values

raw_claim_input[sample(1:nrow(raw_claim_input[raw_claim_input$War.Start.Monthly=="08-M2",]),as.integer(nrow(raw_claim_input[raw_claim_input$War.Start.Monthly=="08-M2",])/10)),]

The value 08-M2 is the filter, I want to do it for all the values available. There are 70 values for War.Start.Monthly, and I want to generate sample for each value of War.Start.Monthly.

Part of data

     Day.Covered           SHIP_DATE Warranty.Start.Qua.Yr War.Start.Monthly AssemblyDateUpdated Warranty.End.Date Warranty.End.Qur.Yr War.End.Monthly
252754         365    06-04-2008 00:00                 08-Q2             08-M6    06-03-2008 00:00        08-04-2064               64-Q2           64-M4
441605        1095 08-17-2010 11:13:07                 10-Q3             10-M8 08-16-2010 12:09:57        08-04-2064               64-Q2           64-M4
583636         731 10-17-2012 00:00:00                 12-Q4            12-M10 10-16-2012 00:00:00        08-04-2064               64-Q2           64-M4
115586         731    01-04-2013 00:00                 13-Q1             13-M1    01-03-2013 00:00        08-04-2064               64-Q2           64-M4
334221        1095 06-13-2011 12:29:23                 11-Q2             11-M6    06-11-2011 11:25        08-04-2064               64-Q2           64-M4
146656        1095 03-16-2011 10:54:37                 11-Q1             11-M3 03-15-2011 08:14:40        08-04-2064               64-Q2           64-M4
249956        1095 06-18-2008 12:35:06                 08-Q2             08-M6    06-06-2008 10:51        08-04-2064               64-Q2           64-M4
276295         731 05-18-2011 00:00:00                 11-Q2             11-M5 05-18-2011 00:00:00        19-11-2014               14-Q4          14-M11
582423         731 10-22-2012 00:00:00                 12-Q4            12-M10 10-22-2012 00:00:00        08-04-2064               64-Q2           64-M4
380369         730    08-04-2009 17:43                 09-Q3             09-M7 07-31-2009 07:14:17        18-01-2012               12-Q1           12-M1

Please let me know if more details needed.

解决方案

This will do:

X <- read.csv(text="Day.Covered,SHIP_DATE,Warranty.Start.Qua.Yr,War.Start.Monthly,AssemblyDateUpdated,Warranty.End.Date,Warranty.End.Qur.Yr,War.End.Monthly
 365,    06-04-2008 00:00, 08-Q2,  08-M6,    06-03-2008 00:00, 08-04-2064 ,64-Q2,  64-M4
1095, 08-17-2010 11:13:07, 10-Q3,  10-M8, 08-16-2010 12:09:57, 08-04-2064 ,64-Q2,  64-M4
 731, 10-17-2012 00:00:00, 12-Q4, 12-M10, 10-16-2012 00:00:00, 08-04-2064 ,64-Q2,  64-M4
 731,    01-04-2013 00:00, 13-Q1,  13-M1,    01-03-2013 00:00, 08-04-2064 ,64-Q2,  64-M4
1095, 06-13-2011 12:29:23, 11-Q2,  11-M6,    06-11-2011 11:25, 08-04-2064 ,64-Q2,  64-M4
1095, 03-16-2011 10:54:37, 11-Q1,  11-M3, 03-15-2011 08:14:40, 08-04-2064 ,64-Q2,  64-M4
1095, 06-18-2008 12:35:06, 08-Q2,  08-M6,    06-06-2008 10:51, 08-04-2064 ,64-Q2,  64-M4
 731, 05-18-2011 00:00:00, 11-Q2,  11-M5, 05-18-2011 00:00:00, 19-11-2014 ,14-Q4, 14-M11
 731, 10-22-2012 00:00:00, 12-Q4, 12-M10, 10-22-2012 00:00:00, 08-04-2064 ,64-Q2,  64-M4
 730,    08-04-2009 17:43, 09-Q3,  09-M7, 07-31-2009 07:14:17, 18-01-2012 ,12-Q1,  12-M")

# Replicate X to have enough data for this example.
X <- X[rep(seq(nrow(X)), 100),]

# Partition the data according to quarter.
partitions <- split(X, X$Warranty.Start.Qua.Yr)
# Draw samples from each partition.
samples <- lapply(partitions, function(p) p[sample(nrow(p), nrow(p)/10),])

这篇关于使用条件R从给定的数据框创建样本集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆