如何按日期对数据进行子集化并在R中执行多项操作? [英] How to subset data frame by date and perform multiple operations in R?
问题描述
我每天接收CSV报告,每个报告具有相同数量的变量,但来自不同的时间.我想基于日期运行一些简单的分析并保存结果.我认为 for
循环可以完成这项工作,但我只知道基础知识.理想情况下,我只需要每月运行一次脚本并获得结果.任何指导或建议,我们感激不尽.
I receive daily CSV reports, and each has the same number of variables but from different times. I want to run some simple analysis based on date and save the results. I think a for
loop can do the job, but I only know the basics. Ideally, I only need to run the script once a month and get the results. Any guidance or advise is appreciated.
假设我在一个文件夹中有两个CSV报告:
Let's say I have two CSV reports in a folder:
#File 1 - 20200624.csv
Date Market Salesman Product Quantity Price Cost
6/24/2020 A MF Apple 20 1 0.5
6/24/2020 A RP Apple 15 1 0.5
6/24/2020 A RP Banana 20 2 0.5
6/24/2020 A FR Orange 20 3 0.5
6/24/2020 B MF Apple 20 1 0.5
6/24/2020 B RP Banana 20 2 0.5
#File 2 - 20200625.csv
Date Market Salesman Product Quantity Price Cost
6/25/2020 A MF Apple 10 1 0.6
6/25/2020 A MF Banana 15 1 0.6
6/25/2020 A RP Banana 10 2 0.6
6/25/2020 A FR Orange 15 3 0.6
6/25/2020 B MF Apple 20 1 0.6
6/25/2020 B RP Banana 20 2 0.6
我使用以下代码将所有文件导入R:
I imported all the files into R using the following codes:
library(readr)
library(dplyr)
#Import files
files <- list.files(path = "~/JuneReports",
pattern = "*.csv", full.names = T)
tbl <- sapply(files, read_csv, simplify=FALSE) %>%
bind_rows(.id = "id")
#Remove the "id" column
tbl2 <- tbl[,-1]
#Subset the data frame to get only Mark A, as Market B is irrelavant.
tbl3 <- subset(tbl2, Market == "A")
head(tbl3)
# A tibble: 6 x 7
Date Market Salesman Product Quantity Price Cost
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 6/24/2020 A MF Apple 20 1 0.5
2 6/24/2020 A RP Apple 15 1 0.5
3 6/24/2020 A RP Banana 20 2 0.5
4 6/24/2020 A FR Orange 20 3 0.5
5 6/25/2020 A MF Apple 10 1 0.6
6 6/25/2020 A MF Banana 15 1 0.6
以下是我想要获得的结果:
Below are the results I want to get:
Date Market Revenue Total Cost Apples Sold Bananas Sold Oranges Sold
6/24/2020 A 135 37.5 35 20 20
6/25/2020 A 90 30 15 25 15
#Revenue = sumproduct(Quantity, Price)
#Total Cost = sumproduct(Quantity, Cost)
#Apples/Bananas/Oranges Sold = sum(Product == "Apple/Banana/Orange")
推荐答案
我们按日期",市场"分组,计算数量"与价格"和成本"的乘积之和,.add
,它也与产品"一起放在 group_by
中,获得数量"的 sum
,并使用 pivot_wider
重塑形状变成宽"格式
We group by 'Date', 'Market', calculate the sum of product of 'Quantity' with 'Price', and 'Cost', .add
that also in the group_by
along with 'Product', get the sum
of 'Quantity' and use pivot_wider
to reshape into 'wide' format
library(dplyr) # 1.0.0
library(tidyr)
df1 %>%
group_by(Date, Market) %>%
group_by(Revenue = c(Quantity %*% Price),
TotalCost = c(Quantity %*% Cost),
Product, .add = TRUE) %>%
summarise(Sold = sum(Quantity)) %>%
pivot_wider(names_from = Product, values_from = Sold)
# A tibble: 2 x 7
# Groups: Date, Market, Revenue, TotalCost [2]
# Date Market Revenue TotalCost Apple Banana Orange
# <chr> <chr> <dbl> <dbl> <int> <int> <int>
#1 6/24/2020 A 135 37.5 35 20 20
#2 6/25/2020 A 25 15 10 15 NA
数据
df1 <- structure(list(Date = c("6/24/2020", "6/24/2020", "6/24/2020",
"6/24/2020", "6/25/2020", "6/25/2020"), Market = c("A", "A",
"A", "A", "A", "A"), Salesman = c("MF", "RP", "RP", "FR", "MF",
"MF"), Product = c("Apple", "Apple", "Banana", "Orange", "Apple",
"Banana"), Quantity = c(20L, 15L, 20L, 20L, 10L, 15L), Price = c(1L,
1L, 2L, 3L, 1L, 1L), Cost = c(0.5, 0.5, 0.5, 0.5, 0.6, 0.6)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
这篇关于如何按日期对数据进行子集化并在R中执行多项操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!