使用R转义CSV文件 [英] Pivoting a CSV file using R

查看:148
本文介绍了使用R转义CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的文件:

 类型created_at repository_name 
1 IssuesEvent 2012-03-11 06:48:31 bootstrap
2 IssuesEvent 2012-03-11 06:48:31 bootstrap
3 IssuesEvent 2012-03-11 06:48:31 bootstrap
4 IssuesEvent 2012-03- 11 06:52:50 bootstrap
5 IssuesEvent 2012-03-11 06:52:50 bootstrap
6 IssuesEvent 2012-03-11 06:52:50 bootstrap
7 IssueCommentEvent 2012-03 -11 07:03:57 bootstrap
8 IssueCommentEvent 2012-03-11 07:03:57 bootstrap
9 IssueCommentEvent 2012-03-11 07:03:57 bootstrap
10 IssuesEvent 2012- 03-11 07:03:58 bootstrap
11 IssuesEvent 2012-03-11 07:03:58 bootstrap
12 IssuesEvent 2012-03-11 07:03:58 bootstrap
13 WatchEvent 2012 -03-11 07:15:44 bootstrap
14 WatchEvent 2012-03-11 07:15:44 bootstrap
15 WatchEvent 2012-03-11 07:15:44 bootstrap
16 WatchEvent 2012-03-11 07:18:45 hogan.js
17 WatchEvent 2012-03-11 07:18:45 hogan.js
18 WatchEvent 2012-03-11 07:18:45 hogan。 js

我使用的数据集可以在 https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis /twitter_events_mini.csv



我想为repository_name列中的每个条目创建一个具有列的表bootstrap,hogan.js)。在该列中,我需要具有来自与该条目对应的类型列的数据(即,只有当前类型列的行,当前repository_name列中的值bootstrap新引导列)。因此:




  • 时间戳只是用于排序,不需要通过在整行上同步(事实上,它们可以被删除,数据已经根据时间戳进行排序)

  • 即使IssuesEvent重复10x,我需要保留所有这些,因为我将使用R包执行序列分析TraMineR

  • 列的长度不等于

  • 不同repos的列之间没有关系(repository_name)



    • 换句话说,我想要一个看起来像这样的表:

        bootstrap hogan.js 
      1 IssuesEvent PushEvent
      2 IssuesEvent IssuesEvent
      3 OssueCommentEvent WatchEvent


      b $ b

      我如何在R中实现这一点?



      我使用reshape包的一些失败尝试可以在 https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis /reshaping_bigqueries.R

      解决方案

      您的示例数据:

        data < -  structure(list(type = structure(c(2L,2L,2L,2L,2L,2L,1L,
      1L,1L,2L, 2L,3L,3L,3L,3L,3L,3L),.Label = c(IssueCommentEvent,
      IssuesEvent,WatchEvent),class =factor),created_at = structure c(1L,
      1L,1L,2L,2L,2L,3L,3L,3L,4L,4L,4L,5L,5L,5L,6L,6L,
      6L),.Label = c(2012-03-11 06:48:31,2012-03-11 06:52:50,
      2012-03-11 07:03:57,2012-03- 11 07:03:58,2012-03-11 07:15:44,
      2012-03-11 07:18:45),class =factor),repository_name = structure c(1L,
      1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,2L,2L,
      2L),.Label = c(bootstrap,hogan.js),class =factor)),.Names = c(type,
      created_at,repository_name),class =data.frame ,row.names = c(NA,
      -18L))

      当对于相同的 created_at 值多次显示时,您希望只有一个类型的预期输出,换句话说,要删除重复项:

        data<  -  unique(data)
      pre>

      然后,提取所有类型条目每 repository_name 按照它们出现的顺序,您只需使用:

        data.split<  -  split(data $ type,data $ repository_name)
      data.split
      #$ bootstrap
      #[1] IssuesEvent IssuesEvent IssueCommentEvent
      #[4] IssuesEvent WatchEvent
      #级别:IssueCommentEvent IssuesEvent WatchEvent

      #$ hogan.js
      #[1] WatchEvent
      #级别:IssueCommentEvent IssuesEvent WatchEvent

      它返回一个列表,它是具有不同长度的向量集合的R数据结构。



      现在您已经提供了输出数据的示例,显而易见的是,您的预期输出确实是一个data.frame。您可以使用以下函数将上面的列表转换为用 NA 填充的data.frame:

        list.to.df<  -  function(arg.list){
      max.len< - max(sapply(arg.list,length))
      arg.list < - lapply(arg.list,`length< -`,max.len)
      as.data.frame(arg.list)
      }

      df.out< ; - list.to.df(data.split)
      df.out
      #bootstrap hogan.js
      #1 IssuesEvent WatchEvent
      #2 IssuesEvent<
      #3 IssueCommentEvent< NA>
      #4 IssuesEvent< NA>
      #5 WatchEvent< NA>

      然后,您可以使用

      将其保存到文件中

        write.csv(df.out,file =out.csv,quote = FALSE,na =,row.names = FALSE)

      获得与您在github上发布的输出格式完全相同的输出格式。


      I have a file that looks like this:

                       type          created_at repository_name
      1         IssuesEvent 2012-03-11 06:48:31       bootstrap
      2         IssuesEvent 2012-03-11 06:48:31       bootstrap
      3         IssuesEvent 2012-03-11 06:48:31       bootstrap
      4         IssuesEvent 2012-03-11 06:52:50       bootstrap
      5         IssuesEvent 2012-03-11 06:52:50       bootstrap
      6         IssuesEvent 2012-03-11 06:52:50       bootstrap
      7   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
      8   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
      9   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
      10        IssuesEvent 2012-03-11 07:03:58       bootstrap
      11        IssuesEvent 2012-03-11 07:03:58       bootstrap
      12        IssuesEvent 2012-03-11 07:03:58       bootstrap
      13         WatchEvent 2012-03-11 07:15:44       bootstrap
      14         WatchEvent 2012-03-11 07:15:44       bootstrap
      15         WatchEvent 2012-03-11 07:15:44       bootstrap
      16         WatchEvent 2012-03-11 07:18:45        hogan.js
      17         WatchEvent 2012-03-11 07:18:45        hogan.js
      18         WatchEvent 2012-03-11 07:18:45        hogan.js
      

      The dataset that I'm working with can be accessed on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv.

      I want to create a table that has a column for each entry in the "repository_name" column (e.g. bootstrap, hogan.js). In that column I need to have the data from the "type" column that corresponds to that entry (i.e. only rows form the current "type" column that also has the value "bootstrap" in the current "repository_name" column should fall under the new "bootstrap" column). Hence:

      • Time stamps is just for ordering and do not need to by synchronized across the row (in fact they can be deleted, as the data is already sorted according to timestamps)
      • Even if "IssuesEvent" is repeated 10x I need to retain all of these, since I will be doing sequence analysis using the R package TraMineR
      • Columns can be of unequal length
      • There is no relationship between the columns for different repos ("repository_name")

      In other words, I would want a table that looks something like this:

           bootstrap            hogan.js
      1    IssuesEvent          PushEvent
      2    IssuesEvent          IssuesEvent
      3    OssueCommentEvent    WatchEvent
      

      How can I accomplish this in R?

      Some of my failed attempts using the reshape package can be found on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/reshaping_bigqueries.R.

      解决方案

      Your sample data:

      data <- structure(list(type = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 
      1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("IssueCommentEvent", 
      "IssuesEvent", "WatchEvent"), class = "factor"), created_at = structure(c(1L, 
      1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
      6L), .Label = c("2012-03-11 06:48:31", "2012-03-11 06:52:50", 
      "2012-03-11 07:03:57", "2012-03-11 07:03:58", "2012-03-11 07:15:44", 
      "2012-03-11 07:18:45"), class = "factor"), repository_name = structure(c(1L, 
      1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
      2L), .Label = c("bootstrap", "hogan.js"), class = "factor")), .Names = c("type", 
      "created_at", "repository_name"), class = "data.frame", row.names = c(NA, 
      -18L))
      

      I gather from your expected output that you want only one type when it shows up multiple times for the same created_at value, in other words you want to remove duplicates:

      data <- unique(data)
      

      Then, to extract all type entries per repository_name in the order they appear, you can simply use:

      data.split <- split(data$type, data$repository_name)
      data.split
      # $bootstrap
      # [1] IssuesEvent       IssuesEvent       IssueCommentEvent
      # [4] IssuesEvent       WatchEvent       
      # Levels: IssueCommentEvent IssuesEvent WatchEvent
      # 
      # $hogan.js
      # [1] WatchEvent
      # Levels: IssueCommentEvent IssuesEvent WatchEvent
      

      It returns a list which is the R data structure of choice for a collection of vectors with different lengths.

      Edit: Now that you have provided an example of your output data, it has become more apparent that your expected output is indeed a data.frame. You can convert the list above into a data.frame padded with NAs using the following function:

      list.to.df <- function(arg.list) {
         max.len  <- max(sapply(arg.list, length))
         arg.list <- lapply(arg.list, `length<-`, max.len)
         as.data.frame(arg.list)
      }
      
      df.out <- list.to.df(data.split)
      df.out
      #           bootstrap   hogan.js
      # 1       IssuesEvent WatchEvent
      # 2       IssuesEvent       <NA>
      # 3 IssueCommentEvent       <NA>
      # 4       IssuesEvent       <NA>
      # 5        WatchEvent       <NA>
      

      You can then save that to a file using

      write.csv(df.out, file = "out.csv", quote = FALSE, na = "", row.names = FALSE)
      

      to get the exact same output format as the one you published on github.

      这篇关于使用R转义CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆