从子集中选择观察值以基于 R 中的大型数据框创建新子集 [英] Select observations from a subset to create a new subset based on a large dataframe in R

查看:12
本文介绍了从子集中选择观察值以基于 R 中的大型数据框创建新子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含许多列和行的数据集 (Purchase.df).这个问题的重要变量名称是Customer"、OrderDate"、DateRank"(对日期进行排序以便我可以找到最小的日期)和BrandName".以下是我正在使用的一个非常小的示例:(我是这个网站的新手,所以我希望我在下面粘贴的内容有效)

I have a dataset (Purchase.df) that contains many columns and rows. The important variable names for this question are "Customer", "OrderDate", "DateRank" (which ranks the dates so I can find the smallest date) and "BrandName." Below is a very small sample of what I'm working with: (I'm new to this website, so I hope what I paste below works)

Purchase.df<-structure(list(Customer = c(10071535L, 10071535L, 10071535L, 
10071535L, 10071535L, 10071535L, 10071711L, 10071711L, 10071711L, 
10071711L, 10071711L, 10071711L, 10071711L, 10071711L, 10071711L, 
10071711L, 10071711L, 10071711L, 10072059L, 10072059L, 10072059L, 
10072113L, 10072113L, 10072113L, 10072113L, 10072113L, 10072113L, 
10072113L), BrandName = structure(c(1L, 2L, 2L, 2L, 3L, 3L, 2L, 
2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 3L, 3L, 3L, 3L), .Label = c("X", "Y", "Z"), class = "factor"), 
OrderDate = structure(c(14L, 14L, 15L, 16L, 19L, 20L, 11L, 
18L, 5L, 6L, 1L, 17L, 21L, 22L, 23L, 8L, 10L, 13L, 7L, 9L, 
12L, 4L, 4L, 2L, 2L, 2L, 3L, 3L), .Label = c("1/17/2011 0:00", 
"1/19/2010 0:00", "1/25/2010 0:00", "1/4/2010 0:00", "10/22/2010 0:00", 
"11/15/2010 0:00", "11/23/2011 0:00", "12/14/2011 0:00", 
"12/16/2011 0:00", "2/7/2012 0:00", "3/16/2010 0:00", "3/21/2012 0:00", 
"4/16/2012 0:00", "4/27/2012 0:00", "5/16/2012 0:00", "5/30/2012 0:00", 
"5/5/2011 0:00", "6/1/2010 0:00", "6/12/2012 0:00", "7/3/2012 0:00", 
"8/1/2011 0:00", "8/16/2011 0:00", "9/19/2011 0:00"), class = "factor"), 
DateRank = c(18.5, 18.5, 20, 21, 24, 25, 15, 23, 9, 10, 1, 
22, 26, 27, 28, 12, 14, 17, 11, 13, 16, 7.5, 7.5, 3, 3, 3, 
5.5, 5.5)), .Names = c("Customer", "BrandName", "OrderDate", 
"DateRank"), row.names = c(NA, -28L), class = "data.frame")

我创建了这个大型数据集 (subset.df) 的一个子集,它为每个客户找到了第一个 OrderDate,并告诉我他们购买了哪个品牌.我使用以下代码来执行此操作:

I've created a subset of this large dataset (subset.df) which finds the first OrderDate for each customer, and tells me which brand they purchased. I used the following code to do this:

subset1<-split(Purchase.df,Purchase.df$Customer)
subset2<-lapply(split(Purchase.df,Purchase.df$Customer), function(chunk) chunk[which(chunk$DateRank==min(chunk$DateRank)),])
subset.df<-do.call(rbind, as.list(subset2))

现在,我想弄清楚哪些客户在第一个 OrderDate 订购了品牌 X,并创建一个新数据集 (BigSubset.df),其中包含在第一个订购日期购买品牌 X 的客户的所有 OrderDates.

Now, I want to figure out which customers ordered Brand X on their first OrderDate, and create a new dataset (BigSubset.df) that contains all of the OrderDates for the customers that purchased Brand X on their first order date.

应该是这样的:

Customer    BrandName   OrderDate   DateRank
10071535    X   4/27/2012 0:00  18.5
10071535    Y   4/27/2012 0:00  18.5
10071535    Y   5/16/2012 0:00  20
10071535    Y   5/30/2012 0:00  21
10071535    Z   6/12/2012 0:00  24
10071535    Z   7/3/2012 0:00   25
10072059    X   11/23/2011 0:00 11
10072059    X   12/16/2011 0:00 13
10072059    X   3/21/2012 0:00  16
10072113    X   1/4/2010 0:00   7.5
10072113    Y   1/4/2010 0:00   7.5
10072113    Y   1/19/2010 0:00  3
10072113    Z   1/19/2010 0:00  3
10072113    Z   1/19/2010 0:00  3
10072113    Z   1/25/2010 0:00  5.5
10072113    Z   1/25/2010 0:00  5.5

当我尝试从 Purchase.df 创建 BigSubset.df 时,我似乎无法让 R 引用较小的数据集,因为行数不相等.我在 Google 上搜索过,但没有看到任何答案,所以我什至不确定这在 R 中是否可行.让我知道你的想法.

I can't seem to get R to reference the smaller dataset when I attempt to create BigSubset.df from Purchase.df because the number of rows are not equal. I've searched on Google and haven't seen any answers, so I'm not even sure if this is possible in R. Let me know what you think.

推荐答案

也许我理解错了,但我相信这是可行的:

Maybe I'm misunderstanding, but I believe this works:

Xfirst <- as.vector(subset.df[subset.df$BrandName == "X", ])$Customer
BigSubset.df <- Purchase.df[Purchase.df$Customer %in% Xfirst, ]

我认为您的 daterank 可能存在错误,因为在您的示例中,客户 10072113 的日期为 2010 年 1 月 19 日排名第 3,但较早的 2010 年 1 月 4 日排名为 7.5.(旁注,在您的 chunk 函数中,您可以使用 which.min(chunk$DateRank) 而不是 which(chunk$DateRank==min(chunk$DateRank)),我认为效率更高.)

I think you may have a bug in your daterank because in your example Customer 10072113 has date 1/19/2010 ranked 3 but the earlier 1/4/2010 ranked 7.5. (Sidenote, in your chunk function you can use which.min(chunk$DateRank) instead of which(chunk$DateRank==min(chunk$DateRank)), which I believe is more efficient.)

这篇关于从子集中选择观察值以基于 R 中的大型数据框创建新子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆