确定值更改的日期,并使用R中的sum()和diff()汇总数据 [英] Identify a value changes' date and summarize the data with sum() and diff() in R

查看:88
本文介绍了确定值更改的日期,并使用R中的sum()和diff()汇总数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

样本数据:

 product_id <- c("1000","1000","1000","1000","1000","1000", "1002","1002","1002","1002","1002","1002")
    qty_ordered <- c(1,2,1,1,1,1,1,2,1,2,1,1)
    price <- c(2.49,2.49,2.49,1.743,2.49,2.49,  2.093,2.093,2.11,2.11,2.11, 2.97)
    date <- c("2/23/15","2/23/15",  '3/16/15','3/16/15','5/16/15',  "6/18/15",  "2/19/15","3/19/15","3/19/15","3/19/15","3/19/15","4/19/15")
    sampleData <- data.frame(product_id,    qty_ordered,    price,  date)

我想确定每次价格变化的时间.另外,我想对这两个价格更改日期之间的总qty_ordered进行sum().例如, 对于product_id == "1000",价格在15年3月16日从$ 2.49更改为$ 1.743.总数qty_ordered为1 + 2 + 1 = 4; 这两个最早的价格变动日期之间的差值为2/23/15到3/16/15,即21天.

I would like to identify every time when a change in a price occurred. Also, I would like to sum() the total qty_ordered between those two price change dates. For example, For product_id == "1000", price changed occurred on 3/16/15 from $2.49 to $1.743. The total qty_ordered is 1+2+1=4; the difference between those two earliest date of price change is from 2/23/15 to 3/16/15 which is 21 days.

因此,新数据框应为:

product_id sum_qty_ordered price    date_diff 
1000           4          2.490             21 
1000           1           1.743            61 
1000           2           2.490            33 

这是我尝试过的:

**注意:在这种情况下,简单的"dplyr::group_by"将不起作用,因为它将忽略日期效果.

**NOTE: for this case, a simple "dplyr::group_by" will not work since it will ignore the date effect.

1)我从

1) I found this code from Determine when columns of a data.frame change value and return indices of the change: This is to identify every time when the price changed, which identify the first date when the price changed for each product.

IndexedChanged <- c(1,which(rowSums(sapply(sampleData[,3],diff))!=0)+1)
sampleData[IndexedChanged,]

但是,如果使用该代码,我不确定如何为每个条目计算sum(qty_ordered)和日期差.

However, I am not sure how to calculate the sum(qty_ordered) and the date difference for each of those entries if I use that code.

2)我试图编写一个WHILE循环来临时存储每批product_id,价格,日期范围(例如,带有一个product_id的数据框的子集,一个价格,以及所有条目的范围从最早的价格更改日期,直到价格更改的最后日期为止), 然后,汇总该子集以获得sum(sum_qty_ordered)和日期差异. 但是,我认为我总是对WHILE和FOR感到困惑,因此我的代码存在一些问题.这是我的代码:

2) I tried to write a WHILE loop to temporarily store each batch of product_id, price, range of dates (e.g. a subset of data frame with one product_id, one price, and all entries ranged from the earliest date of price change till the last date of price before it changed), and then, summarise that subset to get sum(sum_qty_ordered) and the date diff. However, I think I always am confused by WHILE and FOR, so my code has some problems in it. Here is my code:

创建一个空的数据框以用于以后的数据存储

create an empty data frame for later data storage

 NewData_Ready <- data.frame(
                     product_id = character(),
                     price = double(),
                     early_date = as.Date(character()),
                     last_date=as.Date(character()),
                     total_qty_demanded = double(),                          
                     stringsAsFactors=FALSE) 

创建一个临时表来存储批量价格订单条目

create a temp table to store the batch price order entries

 temp_dataset <- data.frame(
                     product_id = character(),
                     qty_ordered = double(),
                     price = double(),
                     date=as.Date(character()),                                  
                     stringsAsFactors=FALSE) 

循环: 这太乱了……可能没有意义,所以我对此确实有帮助.

loop: This is messy...and probably not make sense, so I do really help on this.

for ( i in unique(sampleData$product_id)){
    #for each unique product_id in the dataset, we are gonna loop through it based on product_id
    #for first product_id which is "1000"
    temp_table <- sampleData[sampleData$product_id == "i", ] #subset dataset by ONE single product_id
    #this dataset only has product of "1000" entries

    #starting a new for loop to loop through the entire entries for this product
    for ( p in 1:length(temp_table$product_id)){

        current_price <- temp_table$price[p] #assign current_price to the first price value
        #assign $2.49 to current price. 
        min_date <- temp_table$date[p] #assign the first date when the first price change
        #assign 2015-2-23 to min_date which is the earliest date when price is $2.49

        while (current_price == temp_table$price[p+1]){
        #while the next price is the same as the first price 
        #that is, if the second price is $2.49 is the same as the first price of $2.49, which is TRUE
        #then execute the following statement

            temp_dataset <- rbind(temp_dataset, temp_table[p,])
            #if the WHILE loop is TRUE, means every 2 entries have the same price
            #then combine each entry when price is the same in temp_table with the temp_dataset

            #if the WHILE loop is FALSE, means one entry's price is different from the next one
            #then stop the statement at the above, but do the following
            current_price <- temp_table$price[p+1]
            #this will reassign the current_price to the next price, and restart the WHILE loop

            by_idPrice <- dplyr::group_by(temp_dataset, product_id, price)
            NewRow <- dplyr::summarise(
                                early_date = min(date),
                                last_date = max(date),
                                total_qty_demanded = sum(qty_ordered))
            NewData_Ready <- rbind(NewData_Ready, NewRow)



        }
    }

}

我在相关问题上进行了很多搜索,但尚未发现与该问题相关的任何内容.如果您有任何建议,请告诉我. 另外,请提供有关我的问题的解决方案的一些建议.非常感谢您的时间和帮助!

I have searched a lot on related questions but I have not found anything that are related to this problem yet. If you have some suggestions, please let me know. Also, please provide some suggestions on the solution to my questions. I would greatly appreciate your time and help!

Here is my R version:
platform       x86_64-apple-darwin13.4.0   
arch           x86_64                      
os             darwin13.4.0                
system         x86_64, darwin13.4.0        
status                                     
major          3                           
minor          3.1                         
year           2016                        
month          06                          
day            21                          
svn rev        70800                       
language       R                           
version.string R version 3.3.1 (2016-06-21)
nickname       Bug in Your Hair      

推荐答案

使用data.table:

library(data.table)
setDT(sampleData)

一些预处理:

sampleData[, firstdate := as.Date(date, "%m/%d/%y")]

根据您计算日期差异的方式,我们最好为每一行创建一个日期范围:

Based on how you calculate date diff, we are better off creating a range of dates for each row:

sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]

然后为每次价格变化创建一个新ID:

Then create a new ID for every change in price:

sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]

然后按产品和价格运行计算分组功能:

Then calculate your groupwise functions, by product and price run:

sampleData[,
           .(
             price = unique(price),
             sum_qty = sum(qty_ordered),
             date_diff = max(lastdate) − min(firstdate) 
           ),
           by = .(
             product_id,
             price_id
           )
           ]

   product_id price_id price sum_qty date_diff
1:       1000        0 2.490       4   21 days
2:       1000        1 1.743       1   61 days
3:       1000        2 2.490       2   33 days
4:       1002        0 2.093       3   28 days
5:       1002        1 2.110       4   31 days
6:       1002        2 2.970       1    0 days

我认为1000的最后一次价格变动只有33天,而前一个是61天(而不是60天).如果您将第一天包括在内,分别是22、62和34,则该行应显示为date_diff = max(lastdate) − min(firstdate) + 1

I think the last price change for 1000 is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1

这篇关于确定值更改的日期,并使用R中的sum()和diff()汇总数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆