标题名称作为 r 中的日期 [英] Header names as dates in r
问题描述
我正在尝试计算用户的死亡",这意味着我想确定用户注册程序和他们不再参与程序之间的持续时间.我有两个文件,我使用 read.csv("filename",header=TRUE)
读入:
I'm trying to calculate the "death" of users, meaning I want to determine the time duration between when a user signs up for a program and when they are no longer active in the program. I have two files which I read in using read.csv("filename",header=TRUE)
:
> df
name start.date
1 Allison 2013-03-16
2 Andrew 2013-03-16
3 Carl 2013-03-16
4 Dora 2013-03-17
5 Hilary 2013-03-17
6 Louis 2013-03-19
7 Mary 2013-03-20
8 Mickey 2013-03-20
和文件 2:
> df2
names X04.16.2013 X04.17.2013 X04.18.2014 X04.19.2013
2001 Allison 5 5 0 0
2002 Andrew 0 0 0 0
2003 Carl 8 8 11 10
2004 Dora 6 4 9 3
2005 Hilary 2 0 0 0
2006 Louis 18 10 8 3
2007 Mary 4 7 7 0
2008 Mickey 9 5 0 0
我想做的是将 df2 的标题名称转换为日期,这样我就可以创建一个新的数据框,其中包含用户名、开始日期和死亡天数",这将是当用户在 df2 中的条目为 0:
What I would like to do is convert the header names of df2 to dates, so I can then create a new data frame that has the user names, start date, and "days to death", which would be when a user has an entry of 0 in df2:
name start.date days.to.death
1 Allison 2013-03-16 33
2 Andrew 2013-03-16 0
3 Carl 2013-03-16 NA
4 Dora 2013-03-17 NA
5 Hilary 2013-03-17 31
6 Louis 2013-03-19 NA
7 Mary 2013-03-20 30
8 Mickey 2013-03-20 28
请注意,安德鲁从未活着",而卡尔、朵拉和路易斯还没有死"过.我对 R 还是比较陌生,所以非常感谢任何输入!
Note that Andrew was never "alive" and Carl, Dora, and Louis haven't "died" yet. I'm still rather new to R so any input is much appreciated!
推荐答案
假设 df2 的列标题中存在拼写错误,以下使用 dplyr 和 tidyr 的解决方案可以帮助您完成大部分工作...
Assuming a typo in your column headers for df2, the following solution using dplyr and tidyr gets you most of the way there...
library(tidyr)
library(dplyr)
colnames(df)<-c("names", "start") # To join dfs, the first column header needs to be identical to df2
df$start<-as.Date(df$start, format="%m/%d/%Y") #formatting date
以下在 df2 上工作,通过对数据进行长格式、格式化日期(类似于 MrFlick 的建议)然后只保留其中包含 0 的日期.然后它采用第一个实例(即假设您的日期从左到右按时间顺序排列的最早日期).然后它计算从该日期(结束日期)到 df 开始日期的日期差异.我使用了与 MrFlick 相同的格式 - 但您可以简单地将差异计算为整数.
The following works on df2 by long-forming the data, formatting the dates (similar to MrFlick's suggestion) and then only keeping the dates that have a 0 in them. It then takes the first instance of this (i.e. the earliest date assuming your dates are in chronological order along the cols from left to right). It then calculates the difference in date from that date (the enddate) to the start date from df. I've used the same format as MrFlick - but you could simply calculate the difference as an integer.
df2 %>%
filter(X04.16.2013!=0) %>% #removes Andrew who has 0 in first date col
gather(key,value,2:5) %>%
mutate(date=as.Date(key, format="X%m.%d.%Y")) %>%
left_join(df) %>%
filter(value==0) %>%
group_by(names) %>%
filter(date == nth(date, 1)) %>%
select(names, start, date) %>%
mutate (daydiff=difftime(date,start, unit="days"))
给这个...
names start date daydiff
1 Hilary 2013-03-17 2013-04-17 31 days
2 Allison 2013-03-16 2013-04-18 33 days
3 Mickey 2013-03-20 2013-04-18 29 days
4 Mary 2013-03-20 2013-04-19 30 days
放入 NA 和那些从未住过的人应该很容易.也许这有点帮助?
it should be pretty easy to put in the NAs and those who never lived. Perhaps this helps a little?
这篇关于标题名称作为 r 中的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!