如何在R中扩展大数据帧 [英] How to expand a large dataframe in R

查看:167
本文介绍了如何在R中扩展大数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框架

df <- data.frame(
  id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4), 
  date = c("1985-06-19", "1985-06-19", "1985-06-19", "1985-08-01", 
           "1985-08-01", "1990-06-19", "1990-06-19", "1990-06-19", 
           "1990-06-19", "2000-05-12"), 
  spp = c("a", "b", "c", "c", "d", "b", "c", "d", "a", "b"),
  y = rpois(10, 5))

   id       date spp y
1   1 1985-06-19   a 6
2   1 1985-06-19   b 3
3   1 1985-06-19   c 7
4   2 1985-08-01   c 7
5   2 1985-08-01   d 6
6   3 1990-06-19   b 5
7   3 1990-06-19   c 4
8   3 1990-06-19   d 4
9   3 1990-06-19   a 6
10  4 2000-05-12   b 6

我想扩展它,以便id和spp的每一个组合,并且对于当前不在数据帧中的每个组合,都有 y = 0 。数据帧当前约为100,000行和15列。扩展时,它将是大约30万列(在我的实际数据集中有17个唯一值 spp )。

I want to expand it so that there is every combination of id and spp and have y = 0 for every combination that is not currently in the dataframe. The dataframe is currently about 100,000 rows and 15 columns. When expanded it would be about 300,000 columns (there are 17 unique values of spp in my actual dataset).

对于 id 的每个值, date 是一样的(例如,当id = 2,date always = 1985-08- 01)。在我的真实数据集中,除 spp y 之外的所有列都可以由 id

For every value of id the date is the same (e.g. when id = 2, date always = 1985-08-01). In my real dataset all the columns except spp and y can be specified by the id.

我想要结束如下:

   id       date spp y
   1 1985-06-19   a 6
   1 1985-06-19   b 3
   1 1985-06-19   c 7
   1 1985-06-19   d 0*
   2 1985-08-01   a 0*
   2 1985-08-01   b 0*
   2 1985-08-01   c 7
   2 1985-08-01   d 6
   3 1990-06-19   b 5
   3 1990-06-19   c 4
   3 1990-06-19   d 4
   3 1990-06-19   a 6
   4 2000-05-12   a 0*
   4 2000-05-12   b 6
   4 2000-05-12   c 0*
   4 2000-05-12   d 0*




  • 指示添加的行

  • 我可能会在未来做这个可能更大的数据帧,所以一个快速,高效(时间和内存)的方法来做到这一点将不胜感激,但任何解决方案都能满足我的需求。我想,应该有办法使用 dplyr data.table reshape 包,但我不太熟悉任何一个。我不知道如果最容易扩展行id,spp和y,然后执行一个 left_join() merge() 根据 id 重组日期(以及真实数据框中的所有其他变量)

    I will likely have to do this in the future with potentially much larger data frames, so a quick, efficient (time and memory) way to do this would be appreciated but any solution would satisfy me. I figure there should be ways to use the dplyr, data.table, or reshape packages but I'm not very familiar with any of them. I'm not sure if it would be easiest to expand just rows id, spp, and y, then do a left_join() or merge() to recombine date (and all the other variables in my real dataframe) based on id?

    推荐答案

    expand.grid 这里是一个有用的功能,

    expand.grid is a useful function here,

    mergedData <- merge(
        expand.grid(id = unique(df$id), spp = unique(df$spp)),
        df, by = c("id", "spp"), all =T)
    
    mergedData[is.na(mergedData$y), ]$y <- 0
    
    mergedData$date <- rep(levels(df$date),
                           each = length(levels(df$spp)))
    

    由于您实际上没有对数据的子集做任何事情,所以我不认为 plyr 将有助于提高效率, code> data.table 。

    Since you're not actually doing anything to subsets of the data I don't think plyr will help, maybe more efficient ways with data.table.

    这篇关于如何在R中扩展大数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆