将大量数据从长格式重塑为宽格式的有效方法 - 类似于 dcast [英] Efficient ways to reshape huge data from long to wide format - similar to dcast

查看:12
本文介绍了将大量数据从长格式重塑为宽格式的有效方法 - 类似于 dcast的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题与创建宽"表有关,类似于您可以使用 reshape2 中的 dcast 创建的表.我知道这已经讨论过很多次了,但我的问题是关于如何让这个过程更有效率.我在下面提供了几个示例,这些示例可能会使问题看起来很冗长,但其中大部分只是用于基准测试的测试代码

This question pertains to creating "wide" tables similar to tables you could create using dcast from reshape2. I know this has been discussed many times before, but my question pertains to how to make the process more efficient. I have provided several examples below which might make the question seem lengthy, but most of it is just test code for benchmarking

从一个简单的例子开始,

Starting with a simple example,

> z <- data.table(col1=c(1,1,2,3,4), col2=c(10,10,20,20,30), 
                  col3=c(5,2,2.3,2.4,100), col4=c("a","a","b","c","a"))

> z
     col1 col2  col3 col4
1:    1   10   5.0    a      # col1 = 1, col2 = 10
2:    1   10   2.0    a      # col1 = 1, col2 = 10
3:    2   20   2.3    b
4:    3   20   2.4    c
5:    4   30 100.0    a

我们需要创建一个宽"表,该表将 col4 列的值作为列名,并将 sum(col3) 的值作为 col1 和 col2 的每个组合.

We need to create a "wide" table that will have the values of the col4 column as column names and the value of the sum(col3) for each combination of col1 and col2.

> ulist = unique(z$col4) # These will be the additional column names

# Create long table with sum
> z2 <- z[,list(sumcol=sum(col3)), by='col1,col2,col4']

# Pivot the long table
> z2 <- z2[,as.list((sumcol[match(ulist,col4)])), by=c("col1","col2")]

# Add column names
> setnames(z2[],c("col1","col2",ulist))

> z2
   col1 col2   a   b   c
1:    1   10   7  NA  NA  # a = 5.0 + 2.0 = 7 corresponding to col1=1, col2=10
2:    2   20  NA 2.3  NA
3:    3   20  NA  NA 2.4
4:    4   30 100  NA  NA

我遇到的问题是,虽然上述方法适用于较小的表,但实际上不可能在非常大的表上运行它们(除非您可以等待 x 小时).

The issue I have is that while the above method is fine for smaller tables, it's virtually impossible to run them (unless you are fine with waiting x hours maybe) on very large tables.

我认为这可能与以下事实有关是否有任何值对应于该单元格(这些是上面的 NA 值).因此,新表的大小通常是原始长"表的 2 倍以上.

This, I believe is likely related to the fact that the pivoted / wide table is of a much larger size than the original tables since each row in the wide table has n columns corresponding to the unique values of the pivot column no matter whether there is any value that corresponds to that cell (these are the NA values above). The size of the new table is therefore often 2x+ that of the original "long" table.

我的原始表有大约 5 亿行,大约 20 个唯一值.我尝试仅使用 500 万行来运行上述内容,并且在 R 中需要很长时间(等待它完成的时间太长).

My original table has ~ 500 million rows, about 20 unique values. I have attempted to run the above using only 5 million rows and it takes forever in R (too long to wait for it to complete).

出于基准测试目的,该示例(使用 500 万行)- 使用运行多线程的生产 rdbms 系统在大约 1 分钟内完成.它使用 KDB+/Q (http://www.kx.com) 使用单核在大约 8秒"内完成.这可能不是一个公平的比较,但给人一种感觉,使用替代方法可以更快地完成这些操作.KDB+ 没有稀疏行,因此它为所有单元分配内存,并且仍然比我尝试过的任何其他方法都要快.

For benchmarking purposes, the example (using 5 million rows) - completes in about 1 minute using production rdbms systems running multithreaded. It completes in about 8 "seconds" using single core using KDB+/Q (http://www.kx.com). It might not be a fair comparison, but gives a sense that it is possible to do these operations much faster using alternative means. KDB+ doesn't have sparse rows, so it is allocating memory for all the cells and still much faster than anything else I have tried.

然而,我需要的是 R 解决方案 :) 到目前为止,我还没有找到一种有效的方法来执行类似的操作.

What I need however, is an R solution :) and so far, I haven't found an efficient way to perform similar operations.

如果您有经验并且可以考虑任何替代/更优化的解决方案,我会很想知道这一点.下面提供了一个示例代码.您可以改变 n 的值来模拟结果.数据透视列(c3 列)的唯一值已固定为 25.

If you have had experience and could reflect upon any alternative / more optimal solution, I'd be interested in knowing the same. A sample code is provided below. You can vary the value for n to simulate the results. The unique values for the pivot column (column c3) have been fixed at 25.

n = 100 # Increase this to benchmark

z <- data.table(c1=sample(1:10000,n,replace=T),
    c2=sample(1:100000,n,replace=T),
    c3=sample(1:25,n,replace=T),
    price=runif(n)*10)

c3.unique <- 1:25

z <- z[,list(sumprice=sum(price)), by='c1,c2,c3'][,as.list((sumprice[match(c3.unique,c3)])), by='c1,c2']
setnames(z[], c("c1","c2",c3.unique))

谢谢,

  • 拉杰.

推荐答案

对于 n=1e6 以下使用普通 dcast 大约需要 10 秒,使用 大约需要 4 秒代码>dcast.data.table:

For n=1e6 the following takes about 10 seconds with plain dcast and about 4 seconds with dcast.data.table:

library(reshape2)

dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3)

# or with 1.8.11
dcast.data.table(z, c1 + c2 ~ c3, fun = sum)

这篇关于将大量数据从长格式重塑为宽格式的有效方法 - 类似于 dcast的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆