从子集中选择观察值,以根据R中的大数据帧创建新的子集 [英] Select observations from a subset to create a new subset based on a large dataframe in R

查看:107
本文介绍了从子集中选择观察值,以根据R中的大数据帧创建新的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含很多列和行的数据集(Purchase.df)。这个问题的重要变量名是Customer,OrderDate,DateRank(排列日期,以便找到最小的日期)和BrandName。以下是我正在使用的一个很小的例子:(我是这个网站的新手,所以我希望我下面粘贴的东西)

  Purchase.df< -structure(list(Customer = c(10071535L,10071535L,10071535L,
10071535L,10071535L,10071535L,10071711L,10071711L,10071711L,
10071711L,10071711L,10071711L ,10071711L,10071711L,10071711L,
10071711L,10071711L,10071711L,10072059L,10072059L,10072059L,
10072113L,10072113L,10072113L,10072113L,10072113L,10072113L,
10072113L),BrandName = (1L,2L,2L,2L,3L,3L,2L,
2L,2L,2L,3L,3L,1L,3L,1L,2L,1L,1L,1L,1L,1L, 2L,
2L,3L,3L,3L,3L),.Label = c(X,Y,Z),class =factor),
OrderDate = c(14L,14L,15L,16L,19L,20L,11L,
18L,5L,6L,1L,17L,21L,22L,23L,8L,10L,13L,7L,9L,
12L,4L,4L,2L,2L,2L,3L,3L),标号= c(1/17/2011 0:00,
1/19/2010 0:00,1 / 25/2010 0:00,1/4/2010 0:00,10/22/2010 0:00 ,
11/15/2010 0:00,11/23/2011 0:00,12/14/2011 0:00,
12/16/2011 0 :00,2/7/2012 0:00,3/16/2010 0:00,3/21/2012 0:00,
4/16/2012 0:00 ,4/27/2012 0:00,5/16/2012 0:00,5/30/2012 0:00,
5/5/2011 0:00, 6/1/2010 0:00,6/12/2012 0:00,7/3/2012 0:00,
8/1/2011 0:00,8 / 16/2011 0:00,9/19/2011 0:00),class =factor),
DateRank = c(18.5,18.5,20,21,24,25,15, 23,9,10,1,
22,26,27,28,12,14,17,11,13,16,7.5,7.5,3,3,3,
5.5,5.5) ),.Names = c(Customer,BrandName,OrderDate,
DateRank),row.names = c(NA,-28L),class =data.frame $ b

我创建了一个大型数据集(subset.df)的子集,它为每个客户找到了第一个OrderDate并告诉我他们购买了哪个品牌。我使用以下代码来执行此操作:

  subset1< -split(Purchase.df,Purchase.df $ Customer)
subset2< -lapply(split(Purchase.df,Purchase.df $ Customer),function(chunk)chunk [which(chunk $ DateRank == min(chunk $ DateRank)),])
subset.df< ; -do.call(rbind,as.list(subset2))

现在,我要图哪些客户在他们的第一个订单日期订购了品牌X,并创建了一个新的数据集(BigSubset.df),其中包含在首个订单日期购买品牌X的客户的所有订单日期。



应该看起来像这样:

 客户品牌名称OrderDate DateRank 
10071535 X 4/27/2012 0:00 18.5
10071535 Y 4/27/2012 0:00 18.5
10071535 Y 5/16/2012 0:00 20
10071535 Y 5/30/2012 0:00 21
10071535 Z 6/12/2012 0:00 24
10071535 Z 7/3/2012 0:00 25
10072059 X 11/23/2011 0:00 11
10072059 X 12/16/20 11 0:00 13
10072059 X 3/21/2012 0:00 16
10072113 X 1/4/2010 0:00 7.5
10072113 Y 1/4/2010 0:00 7.5
10072113 Y 1/19/2010 0:00 3
10072113 Z 1/19/2010 0:00 3
10072113 Z 1/19/2010 0:00 3
10072113 Z 1/25/2010 0:00 5.5
10072113 Z 1/25/2010 0:00 5.5



当我尝试从Purchase.df创建BigSubset.df时,似乎不能让R引用较小的数据集,因为行数不相等。我在Google上搜索过,没有看到任何答案,所以我甚至不确定是否可以在R.让我知道你的想法。

解决方案

也许我有误会,但我相信这有用:

  Xfirst < as.vector(subset.df [subset.df $ BrandName ==X,])$ Customer 
BigSubset.df< - Purchase.df [Purchase.df $ Customer%in%Xfirst,]

我认为您可能在您的daterank中有一个错误,因为在您的示例中,客户10072113的日期是1/19/2010排名第3,但较早的1/4/2010排名7.5。 (Sidenote,在你的 chunk 函数中,你可以使用 which.min(chunk $ DateRank)而不是其中(chunk $ DateRank == min(chunk $ DateRank)),我认为更有效率。)


I have a dataset (Purchase.df) that contains many columns and rows. The important variable names for this question are "Customer", "OrderDate", "DateRank" (which ranks the dates so I can find the smallest date) and "BrandName." Below is a very small sample of what I'm working with: (I'm new to this website, so I hope what I paste below works)

Purchase.df<-structure(list(Customer = c(10071535L, 10071535L, 10071535L, 
10071535L, 10071535L, 10071535L, 10071711L, 10071711L, 10071711L, 
10071711L, 10071711L, 10071711L, 10071711L, 10071711L, 10071711L, 
10071711L, 10071711L, 10071711L, 10072059L, 10072059L, 10072059L, 
10072113L, 10072113L, 10072113L, 10072113L, 10072113L, 10072113L, 
10072113L), BrandName = structure(c(1L, 2L, 2L, 2L, 3L, 3L, 2L, 
2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 3L, 3L, 3L, 3L), .Label = c("X", "Y", "Z"), class = "factor"), 
OrderDate = structure(c(14L, 14L, 15L, 16L, 19L, 20L, 11L, 
18L, 5L, 6L, 1L, 17L, 21L, 22L, 23L, 8L, 10L, 13L, 7L, 9L, 
12L, 4L, 4L, 2L, 2L, 2L, 3L, 3L), .Label = c("1/17/2011 0:00", 
"1/19/2010 0:00", "1/25/2010 0:00", "1/4/2010 0:00", "10/22/2010 0:00", 
"11/15/2010 0:00", "11/23/2011 0:00", "12/14/2011 0:00", 
"12/16/2011 0:00", "2/7/2012 0:00", "3/16/2010 0:00", "3/21/2012 0:00", 
"4/16/2012 0:00", "4/27/2012 0:00", "5/16/2012 0:00", "5/30/2012 0:00", 
"5/5/2011 0:00", "6/1/2010 0:00", "6/12/2012 0:00", "7/3/2012 0:00", 
"8/1/2011 0:00", "8/16/2011 0:00", "9/19/2011 0:00"), class = "factor"), 
DateRank = c(18.5, 18.5, 20, 21, 24, 25, 15, 23, 9, 10, 1, 
22, 26, 27, 28, 12, 14, 17, 11, 13, 16, 7.5, 7.5, 3, 3, 3, 
5.5, 5.5)), .Names = c("Customer", "BrandName", "OrderDate", 
"DateRank"), row.names = c(NA, -28L), class = "data.frame")

I've created a subset of this large dataset (subset.df) which finds the first OrderDate for each customer, and tells me which brand they purchased. I used the following code to do this:

subset1<-split(Purchase.df,Purchase.df$Customer)
subset2<-lapply(split(Purchase.df,Purchase.df$Customer), function(chunk) chunk[which(chunk$DateRank==min(chunk$DateRank)),])
subset.df<-do.call(rbind, as.list(subset2))

Now, I want to figure out which customers ordered Brand X on their first OrderDate, and create a new dataset (BigSubset.df) that contains all of the OrderDates for the customers that purchased Brand X on their first order date.

Should look something like this:

Customer    BrandName   OrderDate   DateRank
10071535    X   4/27/2012 0:00  18.5
10071535    Y   4/27/2012 0:00  18.5
10071535    Y   5/16/2012 0:00  20
10071535    Y   5/30/2012 0:00  21
10071535    Z   6/12/2012 0:00  24
10071535    Z   7/3/2012 0:00   25
10072059    X   11/23/2011 0:00 11
10072059    X   12/16/2011 0:00 13
10072059    X   3/21/2012 0:00  16
10072113    X   1/4/2010 0:00   7.5
10072113    Y   1/4/2010 0:00   7.5
10072113    Y   1/19/2010 0:00  3
10072113    Z   1/19/2010 0:00  3
10072113    Z   1/19/2010 0:00  3
10072113    Z   1/25/2010 0:00  5.5
10072113    Z   1/25/2010 0:00  5.5

I can't seem to get R to reference the smaller dataset when I attempt to create BigSubset.df from Purchase.df because the number of rows are not equal. I've searched on Google and haven't seen any answers, so I'm not even sure if this is possible in R. Let me know what you think.

解决方案

Maybe I'm misunderstanding, but I believe this works:

Xfirst <- as.vector(subset.df[subset.df$BrandName == "X", ])$Customer
BigSubset.df <- Purchase.df[Purchase.df$Customer %in% Xfirst, ]

I think you may have a bug in your daterank because in your example Customer 10072113 has date 1/19/2010 ranked 3 but the earlier 1/4/2010 ranked 7.5. (Sidenote, in your chunk function you can use which.min(chunk$DateRank) instead of which(chunk$DateRank==min(chunk$DateRank)), which I believe is more efficient.)

这篇关于从子集中选择观察值,以根据R中的大数据帧创建新的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆