是否有更优雅的方式将破烂的数据转换成整洁的数据框 [英] Are there more elegant ways to transform ragged data into a tidy dataframe

查看:177
本文介绍了是否有更优雅的方式将破烂的数据转换成整洁的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含一列破旧的数据:主题,其中每个主题是一串字符,相邻的主题通过分隔符(在这种情况下为|)分隔开:

 库(lubridate)
事件< - data.frame(
date = dmy(c(12 / 6/2012,13/7/2012,4/8/2012)),
days = c(1,6,0.5),
name = c(Intro to stats ,Stats Winter学校,TidyR工具),
topics = c(probability | R,R | regression | ggplot,tidyR | dplyr),
stringsAsFactors = FALSE

事件数据框看起来像:

 日期日名称主题
1 2012-06-12 1.0简介统计概率| R
2 2012-07-13 6.0统计冬季学校R |回归| ggplot
3 2012-08-04 0.5 TidyR工具tidyR | dplyr

我想转换这个数据框,以便每一行都包含一个主题,并指出在该主题上花费了多少天,假设如果N个主题在D天后提供,D / N日子花在每个主题上。



我不得不这样做,呃,如下所示:

  library(dplyr)

事件%>%
#找出每个事件发送了多少个主题
mutate(
ntopics = sapply(
gregexpr(|,topics,fixed = TRUE),
函数(x)(1 + sum(attr(x,match.length)> 0))

)%>%
#创建每行一个主题的数据框
do(data.frame(
date = rep $ date,$ ntopics),
days = rep(。$ days,。$ ntopics),
name = rep(。$ name,。$ ntopics),
ntopics = rep 。$ ntopics,。$ ntopics),
topic = unlist(strsplit(。$ topics,|,fixed = TRUE)),
stringsAsFactors = FALSE
))%>%
#估计每个主题花费多少天
mutate(daysPerTopic = days / ntopics)

给我们

 日期天名称ntopics主题daysPerTopic 
1 2012-06-12 1.0统计简介2概率0.50
2 2012-06-12 1.0统计简介2 R 0.50
3 2012-07-13 6.0统计冬季学校3 R 2.00
4 2012-07-13 6.0统计冬季学校3回归2.00
5 2012-07-13 6.0统计冬季sch ool 3 ggplot 2.00
6 2012-08-04 0.5 TidyR工具2 tidyR 0.25
7 2012-08-04 0.5 TidyR工具2 dplyr 0.25

我很想知道如何更加优雅地实现这一点。

解决方案

p>您可以尝试:

 库(data.table)
库(devtools)
source_gist 11380733)##

dat < - cSplit(events,topics,sep =|,long)

dat1< - dat [,c (ntopics,daysperTopic):= {m = length(days); list(m,days / m)},
by = name] [,c(1:3,5,4,6 ),with = F]

dat1
#日期天名称ntopics主题daysPerTopic
#1:2012-06-12 1.0统计简介2概率0.50
# 2:2012-06-12 1.0统计简介2 R 0.50
#3:2012-07-13 6.0统计冬季学校3 R 2.00
#4: 2012-07-13 6.0统计冬季学校3回归2.00
#5:2012-07-13 6.0统计冬季学校3 ggplot 2.00
#6:2012-08-04 0.5 TidyR工具2 tidyR 0.25
#7:2012-08-04 0.5 TidyR工具2 dplyr 0.25

dplyr 可以缩短

  library(stringr)
library(dplyr)

res< - mutate(event%>%
mutate(
ntopics = str_count(
topics,pattern =\\ |)+ 1,N = row_number())%>%
do(data.frame(
。[rep(。$ N,。$ ntopics),],
topic = unlist(strsplit (。$ topics,|,fixed = TRUE)))),
daysPerTopic = days / ntopics)%>%
select(-topics,-N)
res
#日期天名称ntopics主题daysPerTopic
#1 2012-06-12 1.0统计简介2概率0.50
#2 2012-06-12 1.0统计介绍2 R 0.50
#3 2012-07-13 6.0统计冬季学校3 R 2.00
#4 2012-07-13 6.0统计冬季学校3回归2.00
#5 2012- 07-13 6.0统计冬季学校3 ggplot 2.00
#6 2012-08-04 0.5 TidyR工具2 tidyR 0.25
#7 2012-08-04 0.5 TidyR工具2 dplyr 0.25


I have a dataframe that contains a column of ragged data: "topics" where each topic is a string of characters, and adjacent topics are separated from each other by a delimiter ("|" in this case):

library(lubridate)
events <- data.frame(
  date  =dmy(c(     "12/6/2012",           "13/7/2012",    "4/8/2012")),
  days  =    c(               1,                     6,           0.5),
  name  =    c("Intro to stats", "Stats Winter school", "TidyR tools"),
  topics=    c( "probability|R", "R|regression|ggplot", "tidyR|dplyr"),
  stringsAsFactors=FALSE
  )

The events dataframe looks like:

        date days                name              topics
1 2012-06-12  1.0      Intro to stats       probability|R
2 2012-07-13  6.0 Stats Winter school R|regression|ggplot
3 2012-08-04  0.5         TidyR tools         tidyR|dplyr

I want to transform this dataframe so that each row contains a single topic, and an indication of how many days were spent on that topic, assuming that if N topics were presented over D days, D/N days were spent on each topic.

I had to do this in a hurry, and did so as follows:

library(dplyr)

events %>%
  # Figure out how many topics were delivered at each event
  mutate(
    ntopics=sapply(
      gregexpr("|", topics, fixed=TRUE),
      function(x)(1 + sum(attr(x, "match.length") > 0 ))
      )
    ) %>%
  # Create a data frame with one topic per row
  do(data.frame(
    date    =rep(   .$date, .$ntopics),
    days    =rep(   .$days, .$ntopics),
    name    =rep(   .$name, .$ntopics),
    ntopics =rep(.$ntopics, .$ntopics),
    topic   =unlist(strsplit(.$topics, "|", fixed=TRUE)),
    stringsAsFactors=FALSE
    )) %>%
  # Estimate roughly how many days were spent on each topic
  mutate(daysPerTopic=days/ntopics)

which gives us

        date days                name ntopics       topic daysPerTopic
1 2012-06-12  1.0      Intro to stats       2 probability         0.50
2 2012-06-12  1.0      Intro to stats       2           R         0.50
3 2012-07-13  6.0 Stats Winter school       3           R         2.00
4 2012-07-13  6.0 Stats Winter school       3  regression         2.00
5 2012-07-13  6.0 Stats Winter school       3      ggplot         2.00
6 2012-08-04  0.5         TidyR tools       2       tidyR         0.25
7 2012-08-04  0.5         TidyR tools       2       dplyr         0.25

I would love to know how do achieve this more elegantly.

解决方案

You could try:

library(data.table)
library(devtools)
source_gist(11380733) ## 

dat <- cSplit(events, "topics", sep="|", "long")

dat1 <-  dat[, c("ntopics", "daysperTopic") := {m= length(days);list(m, days/m)},
                 by=name][,c(1:3,5,4,6),with=F]

dat1
#         date days                name ntopics      topics daysPerTopic
# 1: 2012-06-12  1.0      Intro to stats       2 probability         0.50
# 2: 2012-06-12  1.0      Intro to stats       2           R         0.50
# 3: 2012-07-13  6.0 Stats Winter school       3           R         2.00
# 4: 2012-07-13  6.0 Stats Winter school       3  regression         2.00
# 5: 2012-07-13  6.0 Stats Winter school       3      ggplot         2.00
# 6: 2012-08-04  0.5         TidyR tools       2       tidyR         0.25
# 7: 2012-08-04  0.5         TidyR tools       2       dplyr         0.25

The dplyr could be shortened

library(stringr)
library(dplyr)

res <- mutate(events %>% 
 mutate(
 ntopics = str_count(
     topics, pattern = "\\|") + 1, N = row_number()) %>% 
  do(data.frame(
        .[rep(.$N, .$ntopics), ], 
     topic = unlist(strsplit(.$topics, "|", fixed = TRUE)))), 
   daysPerTopic = days/ntopics) %>%
  select(-topics, -N)
 res
 #        date days                name ntopics       topic daysPerTopic
 #1 2012-06-12  1.0      Intro to stats       2 probability         0.50
 #2 2012-06-12  1.0      Intro to stats       2           R         0.50
 #3 2012-07-13  6.0 Stats Winter school       3           R         2.00
 #4 2012-07-13  6.0 Stats Winter school       3  regression         2.00
 #5 2012-07-13  6.0 Stats Winter school       3      ggplot         2.00
 #6 2012-08-04  0.5         TidyR tools       2       tidyR         0.25
 #7 2012-08-04  0.5         TidyR tools       2       dplyr         0.25

这篇关于是否有更优雅的方式将破烂的数据转换成整洁的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆