透视大数据集 [英] Pivoting a large data set

查看:118
本文介绍了透视大数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来有点像这样的csv(添加了一些标签以提高可读性):

I have a csv that looks a bit like this (tabs added for readability):

Dimension,    Date,    Metric
A,            Mon,     23
A,            Tues,    25
B,            Mon,     7
B,            Tues,    9

我想进行一些距离+峰群分析,这是我之前做过的.但是我喜欢(也许需要)这种格式:

I want to run some distance + hclust analysis, which I've done before. But I like (and perhaps need) it in this format:

Dimension,    Mon,    Tues
A,            23,     25
B,            7,      9

我可以使用Excel在枢轴上轻松完成此操作.问题是我有大约10,000个维度和大约1,200个日期-因此源CSV大约是1200万行乘3列.我想要〜10,000行乘以1,200列.

I could do this pretty easily in Excel with a pivot. The problem is I have ~10,000 dimensions and ~1,200 dates - so the source CSV is about 12M rows by 3 columns. I want ~10,000 rows by ~1,200 columns.

有没有一种方法可以在R中进行此转换?一个小的Python脚本执行此操作的逻辑很简单,但是我不确定它如何处理这么大的CSV-我无法想象这是一个新问题.不想重新发明轮子!

Is there a way I can do this transform in R? The logic of a little Python script to do this is simple, but I'm not sure how it'll handle such a large CSV - and I can't imagine this is a new issue. Don't want to reinvent the wheel!

感谢任何提示:)

推荐答案

或者只是spread:

library(tidyr)
spread(df, Date, Metric)
  Dimension Mon Tues
1         a  23   25
2         b   7    9

基准

 library(microbenchmark)
 microbenchmark(spread(df, Date, Metric))
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max neval
 spread(df, Date, Metric) 1.461595 1.491919 1.628366 1.566753 1.635374 2.606135   100
 microbenchmark(suppressMessages(dcast(dt, Dimension~Date)))
Unit: milliseconds
                                          expr      min       lq     mean   median       uq      max neval
 suppressMessages(dcast(dt, Dimension ~ Date)) 3.365726 3.416384 3.770659 3.471678 4.011316 7.235719   100

microbenchmark(suppressMessages(dcast.data.table(dt, Dimension~Date)))
Unit: milliseconds
                                                 expr      min      lq   

mean   median       uq
 suppressMessages(dcast.data.table(dt, Dimension ~ Date)) 2.375445 2.52218 2.7684 2.614706 2.703075
      max neval
 15.96149   100

,此处为不带sppressMessages

Unit: milliseconds
                                   expr      min       lq     mean median       uq     max neval
 dcast.data.table(dt, Dimension ~ Date) 2.667337 3.428127 4.749301 4.0476 5.289618 14.3823   100

这里的数据表不必猜测:

and here data table does not have to guess:

 microbenchmark(dcast.data.table(dt, Dimension ~ Date, value.var = "Metric"))
Unit: milliseconds
                                                         expr      min       lq    mean   median
 dcast.data.table(dt, Dimension ~ Date, value.var = "Metric") 2.077276 2.118707 2.28623 2.168667
       uq      max neval
 2.320579 5.780479   100

这篇关于透视大数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆