在R中的日期上使用Countif [英] Using Countif on Dates in R
问题描述
**A** | **B** | **C** |**D** |
:----: | :----: | :----:|:----:|
1/1/17 | 3/1/17 |4/1/17 | H |
1/1/17 | 3/1/17 |4/1/17 | H |
2/1/17 | 4/1/17 |5/1/17 | V |
3/1/17 | 5/1/17 |6/1/17 | V |
4/1/17 | 5/1/17 |7/1/17 | H |
4/1/17 | 6/1/17 |7/1/17 | H |
使用下表查找结果,使用R代码
Looking for the result as in the table below using R code
1. A column with Unique list of dates from columns A,B & C above
2. A count of dates <= (less than or equal to) the unique
dates column value in each of the columns A,B & C from above table.
3. Filtered by column D value of 'H' only
结果
Result
**Unique Dates** | **Count of A** | **Count of B** |**Count of C** |
:----: | :----: | :----: | :----: |
1/1/17 | 2 | 0 | 0 |
2/1/17 | 2 | 0 | 0 |
3/1/17 | 2 | 2 | 0 |
4/1/17 | 4 | 2 | 2 |
5/1/17 | 4 | 3 | 2 |
6/1/17 | 4 | 4 | 2 |
7/1/17 | 4 | 0 | 4 |
推荐答案
乍一看,这个问题似乎很简单重塑任务。仔细观察表明,如果我们想完全遵循OP的规范,就不容易实现这些要求:
At first glance, the question seems to be a simple reshaping task. A closer look shows that the requirements aren't easily implemented if we want to follow the OP's specifications exactly to the spot:
- 具有来自A,B和A列的日期的唯一列表的列。 C之上
- 日期计数< =(小于或等于)A,B和A列中的每个
date唯一列值 - 仅按列D的值 H过滤
下面的 data.table
解决方案将数据从宽格式重整为整形,是否进行所有聚合,包括通过分组补充长格式的缺失组合,最后整形为宽格式。在代码内的注释中给出了附加说明。
The data.table
solution below reshapes the data from wide to long form, does all aggregations including supplementing missing combinations in the long form by grouping and reshapes to wide format finally. Additional explanations are given in the comments within the code.
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table
setDT(DT)[
# reshape from wide to long format,
# thereby renaming one column as requested
, melt(.SD, id.vars = "D", value.name = "Unique_Dates")][
# convert dates from character to class Date
, Unique_Dates := lubridate::dmy(Unique_Dates)][
# count occurences by variable & date,
# set key & order by variable & date for subsequent cumsum & join
, .N, keyby = .(D, variable, Unique_Dates)][
# compute cumsum for each variable along unique dates
, N := cumsum(N), by = .(D, variable)][
# join with all possible combinations of D, variables and dates
# use rolling join to fill missing values
CJ(D, variable, Unique_Dates, unique = TRUE), roll = Inf][
# replace remaining NAs
is.na(N), N := 0L][
# finally, reshape selected rows from long to wide
D == "H", dcast(.SD, Unique_Dates ~ paste0("Count_of_", variable))]
Unique_Dates Count_of_A Count_of_B Count_of_C
1: 2017-01-01 2 0 0
2: 2017-01-02 2 0 0
3: 2017-01-03 2 2 0
4: 2017-01-04 4 2 2
5: 2017-01-05 4 3 2
6: 2017-01-06 4 4 2
7: 2017-01-07 4 4 4
- 这些列是根据OP的预期结果命名的。
- 结果包括
2017年1月2日
,尽管该日期仅出现在D == V
的行中,但应该排除在该日期之外最终结果。 - 滚动连接用于填充缺少的值,而不是
zoo :: na.locf()
。 - The columns are named according to OP's expected result.
- The result includes
2017-01-02
as expected although this date appears only in a row withD == "V"
which was supposed to be excluded from the final result. - A rolling join is used to fill missing values instead of
zoo::na.locf()
.
在他的问题中,OP提供了打印格式的示例数据
In his question, the OP has provided sample data in a printed format which was difficult to "scrape":
library(data.table)
DT <- fread(
"**A** | **B** | **C** |**D** |
1/1/17 | 3/1/17 |4/1/17 | H |
1/1/17 | 3/1/17 |4/1/17 | H |
2/1/17 | 4/1/17 |5/1/17 | V |
3/1/17 | 5/1/17 |6/1/17 | V |
4/1/17 | 5/1/17 |7/1/17 | H |
4/1/17 | 6/1/17 |7/1/17 | H |",
sep ="|", drop = 5L, stringsAsFactors = TRUE)[
, setnames(.SD, stringr::str_replace_all(names(DT), "\\*", ""))][]
DT
A B C D
1: 1/1/17 3/1/17 4/1/17 H
2: 1/1/17 3/1/17 4/1/17 H
3: 2/1/17 4/1/17 5/1/17 V
4: 3/1/17 5/1/17 6/1/17 V
5: 4/1/17 5/1/17 7/1/17 H
6: 4/1/17 6/1/17 7/1/17 H
这篇关于在R中的日期上使用Countif的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!