从子集中选择观察值,以根据R中的大数据帧创建新的子集 [英] Select observations from a subset to create a new subset based on a large dataframe in R
问题描述
Purchase.df< -structure(list(Customer = c(10071535L,10071535L,10071535L,
10071535L,10071535L,10071535L,10071711L,10071711L,10071711L,
10071711L,10071711L,10071711L ,10071711L,10071711L,10071711L,
10071711L,10071711L,10071711L,10072059L,10072059L,10072059L,
10072113L,10072113L,10072113L,10072113L,10072113L,10072113L,
10072113L),BrandName = (1L,2L,2L,2L,3L,3L,2L,
2L,2L,2L,3L,3L,1L,3L,1L,2L,1L,1L,1L,1L,1L, 2L,
2L,3L,3L,3L,3L),.Label = c(X,Y,Z),class =factor),
OrderDate = c(14L,14L,15L,16L,19L,20L,11L,
18L,5L,6L,1L,17L,21L,22L,23L,8L,10L,13L,7L,9L,
12L,4L,4L,2L,2L,2L,3L,3L),标号= c(1/17/2011 0:00,
1/19/2010 0:00,1 / 25/2010 0:00,1/4/2010 0:00,10/22/2010 0:00 ,
11/15/2010 0:00,11/23/2011 0:00,12/14/2011 0:00,
12/16/2011 0 :00,2/7/2012 0:00,3/16/2010 0:00,3/21/2012 0:00,
4/16/2012 0:00 ,4/27/2012 0:00,5/16/2012 0:00,5/30/2012 0:00,
5/5/2011 0:00, 6/1/2010 0:00,6/12/2012 0:00,7/3/2012 0:00,
8/1/2011 0:00,8 / 16/2011 0:00,9/19/2011 0:00),class =factor),
DateRank = c(18.5,18.5,20,21,24,25,15, 23,9,10,1,
22,26,27,28,12,14,17,11,13,16,7.5,7.5,3,3,3,
5.5,5.5) ),.Names = c(Customer,BrandName,OrderDate,
DateRank),row.names = c(NA,-28L),class =data.frame $ b
我创建了一个大型数据集(subset.df)的子集,它为每个客户找到了第一个OrderDate并告诉我他们购买了哪个品牌。我使用以下代码来执行此操作:
subset1< -split(Purchase.df,Purchase.df $ Customer)
subset2< -lapply(split(Purchase.df,Purchase.df $ Customer),function(chunk)chunk [which(chunk $ DateRank == min(chunk $ DateRank)),])
subset.df< ; -do.call(rbind,as.list(subset2))
现在,我要图哪些客户在他们的第一个订单日期订购了品牌X,并创建了一个新的数据集(BigSubset.df),其中包含在首个订单日期购买品牌X的客户的所有订单日期。
应该看起来像这样:
客户品牌名称OrderDate DateRank
10071535 X 4/27/2012 0:00 18.5
10071535 Y 4/27/2012 0:00 18.5
10071535 Y 5/16/2012 0:00 20
10071535 Y 5/30/2012 0:00 21
10071535 Z 6/12/2012 0:00 24
10071535 Z 7/3/2012 0:00 25
10072059 X 11/23/2011 0:00 11
10072059 X 12/16/20 11 0:00 13
10072059 X 3/21/2012 0:00 16
10072113 X 1/4/2010 0:00 7.5
10072113 Y 1/4/2010 0:00 7.5
10072113 Y 1/19/2010 0:00 3
10072113 Z 1/19/2010 0:00 3
10072113 Z 1/19/2010 0:00 3
10072113 Z 1/25/2010 0:00 5.5
10072113 Z 1/25/2010 0:00 5.5
当我尝试从Purchase.df创建BigSubset.df时,似乎不能让R引用较小的数据集,因为行数不相等。我在Google上搜索过,没有看到任何答案,所以我甚至不确定是否可以在R.让我知道你的想法。
也许我有误会,但我相信这有用:
Xfirst < as.vector(subset.df [subset.df $ BrandName ==X,])$ Customer
BigSubset.df< - Purchase.df [Purchase.df $ Customer%in%Xfirst,]
我认为您可能在您的daterank中有一个错误,因为在您的示例中,客户10072113的日期是1/19/2010排名第3,但较早的1/4/2010排名7.5。 (Sidenote,在你的 chunk
函数中,你可以使用 which.min(chunk $ DateRank)
而不是其中(chunk $ DateRank == min(chunk $ DateRank))
,我认为更有效率。)
I have a dataset (Purchase.df) that contains many columns and rows. The important variable names for this question are "Customer", "OrderDate", "DateRank" (which ranks the dates so I can find the smallest date) and "BrandName." Below is a very small sample of what I'm working with: (I'm new to this website, so I hope what I paste below works)
Purchase.df<-structure(list(Customer = c(10071535L, 10071535L, 10071535L,
10071535L, 10071535L, 10071535L, 10071711L, 10071711L, 10071711L,
10071711L, 10071711L, 10071711L, 10071711L, 10071711L, 10071711L,
10071711L, 10071711L, 10071711L, 10072059L, 10072059L, 10072059L,
10072113L, 10072113L, 10072113L, 10072113L, 10072113L, 10072113L,
10072113L), BrandName = structure(c(1L, 2L, 2L, 2L, 3L, 3L, 2L,
2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("X", "Y", "Z"), class = "factor"),
OrderDate = structure(c(14L, 14L, 15L, 16L, 19L, 20L, 11L,
18L, 5L, 6L, 1L, 17L, 21L, 22L, 23L, 8L, 10L, 13L, 7L, 9L,
12L, 4L, 4L, 2L, 2L, 2L, 3L, 3L), .Label = c("1/17/2011 0:00",
"1/19/2010 0:00", "1/25/2010 0:00", "1/4/2010 0:00", "10/22/2010 0:00",
"11/15/2010 0:00", "11/23/2011 0:00", "12/14/2011 0:00",
"12/16/2011 0:00", "2/7/2012 0:00", "3/16/2010 0:00", "3/21/2012 0:00",
"4/16/2012 0:00", "4/27/2012 0:00", "5/16/2012 0:00", "5/30/2012 0:00",
"5/5/2011 0:00", "6/1/2010 0:00", "6/12/2012 0:00", "7/3/2012 0:00",
"8/1/2011 0:00", "8/16/2011 0:00", "9/19/2011 0:00"), class = "factor"),
DateRank = c(18.5, 18.5, 20, 21, 24, 25, 15, 23, 9, 10, 1,
22, 26, 27, 28, 12, 14, 17, 11, 13, 16, 7.5, 7.5, 3, 3, 3,
5.5, 5.5)), .Names = c("Customer", "BrandName", "OrderDate",
"DateRank"), row.names = c(NA, -28L), class = "data.frame")
I've created a subset of this large dataset (subset.df) which finds the first OrderDate for each customer, and tells me which brand they purchased. I used the following code to do this:
subset1<-split(Purchase.df,Purchase.df$Customer)
subset2<-lapply(split(Purchase.df,Purchase.df$Customer), function(chunk) chunk[which(chunk$DateRank==min(chunk$DateRank)),])
subset.df<-do.call(rbind, as.list(subset2))
Now, I want to figure out which customers ordered Brand X on their first OrderDate, and create a new dataset (BigSubset.df) that contains all of the OrderDates for the customers that purchased Brand X on their first order date.
Should look something like this:
Customer BrandName OrderDate DateRank
10071535 X 4/27/2012 0:00 18.5
10071535 Y 4/27/2012 0:00 18.5
10071535 Y 5/16/2012 0:00 20
10071535 Y 5/30/2012 0:00 21
10071535 Z 6/12/2012 0:00 24
10071535 Z 7/3/2012 0:00 25
10072059 X 11/23/2011 0:00 11
10072059 X 12/16/2011 0:00 13
10072059 X 3/21/2012 0:00 16
10072113 X 1/4/2010 0:00 7.5
10072113 Y 1/4/2010 0:00 7.5
10072113 Y 1/19/2010 0:00 3
10072113 Z 1/19/2010 0:00 3
10072113 Z 1/19/2010 0:00 3
10072113 Z 1/25/2010 0:00 5.5
10072113 Z 1/25/2010 0:00 5.5
I can't seem to get R to reference the smaller dataset when I attempt to create BigSubset.df from Purchase.df because the number of rows are not equal. I've searched on Google and haven't seen any answers, so I'm not even sure if this is possible in R. Let me know what you think.
Maybe I'm misunderstanding, but I believe this works:
Xfirst <- as.vector(subset.df[subset.df$BrandName == "X", ])$Customer
BigSubset.df <- Purchase.df[Purchase.df$Customer %in% Xfirst, ]
I think you may have a bug in your daterank because in your example Customer 10072113 has date 1/19/2010 ranked 3 but the earlier 1/4/2010 ranked 7.5. (Sidenote, in your chunk
function you can use which.min(chunk$DateRank)
instead of which(chunk$DateRank==min(chunk$DateRank))
, which I believe is more efficient.)
这篇关于从子集中选择观察值,以根据R中的大数据帧创建新的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!