有效的方式重塑巨大的数据从长到宽的格式 - 类似于dcast [英] Efficient ways to reshape huge data from long to wide format - similar to dcast

查看:189
本文介绍了有效的方式重塑巨大的数据从长到宽的格式 - 类似于dcast的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题涉及创建宽表,类似于您可以使用reshape2中的dcast创建的表。我知道这已经讨论过很多次,但我的问题是关于如何使过程更有效率。我已经提供了几个例子,这可能会使问题看起来很长,但大多数只是测试代码基准测试

This question pertains to creating "wide" tables similar to tables you could create using dcast from reshape2. I know this has been discussed many times before, but my question pertains to how to make the process more efficient. I have provided several examples below which might make the question seem lengthy, but most of it is just test code for benchmarking

从一个简单的例子开始,

Starting with a simple example,

> z <- data.table(col1=c(1,1,2,3,4), col2=c(10,10,20,20,30), 
                  col3=c(5,2,2.3,2.4,100), col4=c("a","a","b","c","a"))

> z
     col1 col2  col3 col4
1:    1   10   5.0    a      # col1 = 1, col2 = 10
2:    1   10   2.0    a      # col1 = 1, col2 = 10
3:    2   20   2.3    b
4:    3   20   2.4    c
5:    4   30 100.0    a


$ b b

我们需要创建一个宽表,其col4列的值作为列名,col1和col2的每个组合的和(col3)的值。

We need to create a "wide" table that will have the values of the col4 column as column names and the value of the sum(col3) for each combination of col1 and col2.

> ulist = unique(z$col4) # These will be the additional column names

# Create long table with sum
> z2 <- z[,list(sumcol=sum(col3)), by='col1,col2,col4']

# Pivot the long table
> z2 <- z2[,as.list((sumcol[match(ulist,col4)])), by=c("col1","col2")]

# Add column names
> setnames(z2[],c("col1","col2",ulist))

> z2
   col1 col2   a   b   c
1:    1   10   7  NA  NA  # a = 5.0 + 2.0 = 7 corresponding to col1=1, col2=10
2:    2   20  NA 2.3  NA
3:    3   20  NA  NA 2.4
4:    4   30 100  NA  NA

我的是,虽然上述方法适用于较小的表,它是几乎不可能运行它们(除非你在等待x小时可能很好)在非常大的表。

The issue I have is that while the above method is fine for smaller tables, it's virtually impossible to run them (unless you are fine with waiting x hours maybe) on very large tables.

这,我相信很可能与枢轴/宽表的大小比原始表大得多的事实有关,因为宽表中的每一行都有n列对应于枢轴列的唯一值,不管是否有是对应于该单元格的任何值(这些是上面的NA值)。因此,新表的大小通常是原始长表的2倍。

This, I believe is likely related to the fact that the pivoted / wide table is of a much larger size than the original tables since each row in the wide table has n columns corresponding to the unique values of the pivot column no matter whether there is any value that corresponds to that cell (these are the NA values above). The size of the new table is therefore often 2x+ that of the original "long" table.

我的原始表有大约5亿行,大约20个唯一值。我试图运行上面只使用500万行,它需要永远在R(太长,等待它完成)。

My original table has ~ 500 million rows, about 20 unique values. I have attempted to run the above using only 5 million rows and it takes forever in R (too long to wait for it to complete).

为了进行基准测试,示例(使用500万行)在大约1分钟内使用运行多线程的生产rdbms系统完成。它使用KDB + / Q( http://www.kx.com )使用单核心以约8秒钟完成。这可能不是一个公平的比较,但是有一种感觉,使用替代方法可以更快地做这些操作。 KDB +没有稀疏行,所以它为所有单元格分配内存,仍然比任何我试过的更快。

For benchmarking purposes, the example (using 5 million rows) - completes in about 1 minute using production rdbms systems running multithreaded. It completes in about 8 "seconds" using single core using KDB+/Q (http://www.kx.com). It might not be a fair comparison, but gives a sense that it is possible to do these operations much faster using alternative means. KDB+ doesn't have sparse rows, so it is allocating memory for all the cells and still much faster than anything else I have tried.

但我需要,是一个R解决方案 :),到目前为止,我还没有找到一个有效的方式来执行类似的操作。

What I need however, is an R solution :) and so far, I haven't found an efficient way to perform similar operations.

如果你有经验,反映任何替代/更优化的解决方案,我有兴趣知道相同。下面提供了一个示例代码。您可以更改n的值以模拟结果。枢轴列(列c3)的唯一值已固定为25。

If you have had experience and could reflect upon any alternative / more optimal solution, I'd be interested in knowing the same. A sample code is provided below. You can vary the value for n to simulate the results. The unique values for the pivot column (column c3) have been fixed at 25.

n = 100 # Increase this to benchmark

z <- data.table(c1=sample(1:10000,n,replace=T),
    c2=sample(1:100000,n,replace=T),
    c3=sample(1:25,n,replace=T),
    price=runif(n)*10)

c3.unique <- 1:25

z <- z[,list(sumprice=sum(price)), by='c1,c2,c3'][,as.list((sumprice[match(c3.unique,c3)])), by='c1,c2']
setnames(z[], c("c1","c2",c3.unique))

感谢,


  • Raj。


    • Raj.

    推荐答案

    对于 n = 1e6 ,以下操作大约需要10秒c $ c> dcast 和约4秒与 dcast.data.table

    For n=1e6 the following takes about 10 seconds with plain dcast and about 4 seconds with dcast.data.table:

    library(reshape2)
    
    dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3)
    
    # or with 1.8.11
    dcast.data.table(z, c1 + c2 ~ c3, fun = sum)
    

    这篇关于有效的方式重塑巨大的数据从长到宽的格式 - 类似于dcast的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆