R日期作为包含重复值的列名(需要保留原始日期) [英] R dates as column names containing duplicate values (need to retain original date)
问题描述
我有一个要整理的数据集.我用read.xlsx读入文件,标题中包含的是日期值,即使我在收集/传播数据时重复了这些日期值,也需要保留它们的值.
I have a dataset I'm trying to tidy up. I read in the file with read.xlsx, contained in the header is date values that I need to retain their values even when duplicated when I gather/spread the data.
数据集如下所示.excel中的日期读为数字(很好),问题是可能存在重复的日期(例如43693),我需要保留其原始值.
The data set looks like the below. The dates from excel read in as numbers (which is fine) the issue is that there can be duplicate dates (e.g. 43693) , which I need to keep their original values.
Date 43693 43686 43686 43714 43693
1 Contract 111 222 333 444 555
2 Org1 NR NB NR NB P
3 Org2 P P P NB NR
4 Org3 NB NB NB NB P
当我尝试转换数据时,出现重复名称的错误.
When I try to transform the data I get the error of duplicate names.
最终,我正试图通过像这样的数据来获取数据,其中日期值保留所有重复项(例如43693)
Ultimately I'm trying to get by data shaped like this, where the date value retains any duplicates (e.g. 43693)
Date Contract ORG status
1 43693 111 Org1 NR
2 43493 555 Org1 P
3 43686 111 Org2 P
以下是要测试的df示例:
Here is an example df to test on:
df <- structure(
list(
Date = c("Contract", "Org1", "Org2", "Org3", "Org4"),
'12/16/18' = c("111", "pending", "complete", "complete",
"pending"),
'12/16/18' = c("222", "pending", "complete", "pending",
"complete"),
'1/18/18' = c("222", "pending", "complete", "pending",
"complete") ),
class = "data.frame",
.Names = c("Date", "12/16/18", "12/16/18",'1/18/18'),
row.names = c(NA, -5L)
)
推荐答案
您有两个标题行,这很混乱.我建议重新读取数据,跳过日期行,然后将日期行作为列名称的一部分合并.
You have two header rows, which is pretty messy. I'd recommend re-reading the data, skipping the date line, then incorporating the date line as part of the column names.
如果已经读取了数据,则可以尝试执行以下操作:
If you already have the data read in, you can try something like this:
library(data.table)
df2 <- setDT(df[-1, ])
setnames(df2, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1]))
# Current data
df2
# Org 12/16/18_111 12/16/18_222 1/18/18_222
# 1: Org1 pending pending pending
# 2: Org2 complete complete complete
# 3: Org3 complete pending pending
# 4: Org4 pending complete complete
# melt and split
melt(df2, id.vars="Org")[, c("Date", "Contract") := tstrsplit(variable, "_")][, variable := NULL][]
# Org value Date Contract
# 1: Org1 pending 12/16/18 111
# 2: Org2 complete 12/16/18 111
# 3: Org3 complete 12/16/18 111
# 4: Org4 pending 12/16/18 111
# 5: Org1 pending 12/16/18 222
# 6: Org2 complete 12/16/18 222
# 7: Org3 pending 12/16/18 222
# 8: Org4 complete 12/16/18 222
# 9: Org1 pending 1/18/18 222
# 10: Org2 complete 1/18/18 222
# 11: Org3 pending 1/18/18 222
# 12: Org4 complete 1/18/18 222
如果您确实想坚持使用 dplyr
和 tidyr
,则为以上内容的翻译:
If you do want to stick with dplyr
and tidyr
, here's a translation of the above:
library(dplyr)
library(tidyr)
setNames(df, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1])) %>%
slice(-1) %>%
pivot_longer(-Org) %>%
separate(name, into = c("Date", "Contract"), sep = "_")
请注意,在开始将其他命令链接在一起之前,必须重命名数据集.
Note that you have to rename the dataset before you start chaining the other commands together.
这篇关于R日期作为包含重复值的列名(需要保留原始日期)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!