如何在 R 中扩展大型数据框 [英] How to expand a large dataframe in R

查看:18
本文介绍了如何在 R 中扩展大型数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框

df <- data.frame(id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4),日期 = c("1985-06-19", "1985-06-19", "1985-06-19", "1985-08-01",1985-08-01"、1990-06-19"、1990-06-19"、1990-06-19"、"1990-06-19", "2000-05-12"),spp = c("a", "b", "c", "c", "d", "b", "c", "d", "a", "b"),y = rpois(10, 5))身份证日期 spp y1 1 1985-06-19 一 62 1 1985-06-19 b 33 1 1985-06-19 c 74 2 1985-08-01 c 75 2 1985-08-01 d 66 3 1990-06-19 b 57 3 1990-06-19 c 48 3 1990-06-19 d 49 3 1990-06-19 一 610 4 2000-05-12 b 6

我想扩展它,以便有 id 和 spp 的每个组合,并且对于当前不在数据帧中的每个组合都有 y = 0 .数据框目前大约有 100,000 行和 15 列.展开后大约有 300,000 列(在我的实际数据集中有 17 个 spp 的唯一值).

对于 id 的每个值,date 都是相同的(例如,当 id = 2 时,日期总是 = 1985-08-01).在我的真实数据集中,除了 sppy 之外的所有列都可以由 id 指定.

我想以这样的方式结束:

 id 日期 spp y1 1985-06-19 一 61 1985-06-19 b 31 1985-06-19 c 71 1985-06-19 d 0*2 1985-08-01 a 0*2 1985-08-01 b 0*2 1985-08-01 c 72 1985-08-01 d 63 1990-06-19 b 53 1990-06-19 c 43 1990-06-19 d 43 1990-06-19 一 64 2000-05-12 一个 0*4 2000-05-12 b 64 2000-05-12 c 0*4 2000-05-12 d 0*

  • 表示添加的行

我将来可能不得不使用可能更大的数据帧来执行此操作,因此将不胜感激一种快速、高效(时间和内存)的方法来执行此操作,但任何解决方案都会使我满意.我认为应该有一些方法可以使用 dplyrdata.tablereshape 包,但我对它们中的任何一个都不太熟悉.我不确定仅扩展行 id、spp 和 y 是否最简单,然后执行 left_join()merge() 以重新组合日期(和我真实数据框中的所有其他变量)基于 id?

解决方案

expand.grid 在这里很有用,

mergedData <- 合并(expand.grid(id = unique(df$id), spp = unique(df$spp)),df, by = c("id", "spp"), all =T)合并数据[is.na(mergedData$y),]$y <- 0合并数据$日期 <- rep(levels(df$date),每个 = 长度(级别(df$spp)))

由于您实际上并未对数据的子集执行任何操作,因此我认为 plyr 不会有帮助,也许使用 data.table 可以采用更有效的方法.

I have a dataframe

df <- data.frame(
  id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4), 
  date = c("1985-06-19", "1985-06-19", "1985-06-19", "1985-08-01", 
           "1985-08-01", "1990-06-19", "1990-06-19", "1990-06-19", 
           "1990-06-19", "2000-05-12"), 
  spp = c("a", "b", "c", "c", "d", "b", "c", "d", "a", "b"),
  y = rpois(10, 5))

   id       date spp y
1   1 1985-06-19   a 6
2   1 1985-06-19   b 3
3   1 1985-06-19   c 7
4   2 1985-08-01   c 7
5   2 1985-08-01   d 6
6   3 1990-06-19   b 5
7   3 1990-06-19   c 4
8   3 1990-06-19   d 4
9   3 1990-06-19   a 6
10  4 2000-05-12   b 6

I want to expand it so that there is every combination of id and spp and have y = 0 for every combination that is not currently in the dataframe. The dataframe is currently about 100,000 rows and 15 columns. When expanded it would be about 300,000 columns (there are 17 unique values of spp in my actual dataset).

For every value of id the date is the same (e.g. when id = 2, date always = 1985-08-01). In my real dataset all the columns except spp and y can be specified by the id.

I want to end up with something like:

   id       date spp y
   1 1985-06-19   a 6
   1 1985-06-19   b 3
   1 1985-06-19   c 7
   1 1985-06-19   d 0*
   2 1985-08-01   a 0*
   2 1985-08-01   b 0*
   2 1985-08-01   c 7
   2 1985-08-01   d 6
   3 1990-06-19   b 5
   3 1990-06-19   c 4
   3 1990-06-19   d 4
   3 1990-06-19   a 6
   4 2000-05-12   a 0*
   4 2000-05-12   b 6
   4 2000-05-12   c 0*
   4 2000-05-12   d 0*

  • Indicate added rows

I will likely have to do this in the future with potentially much larger data frames, so a quick, efficient (time and memory) way to do this would be appreciated but any solution would satisfy me. I figure there should be ways to use the dplyr, data.table, or reshape packages but I'm not very familiar with any of them. I'm not sure if it would be easiest to expand just rows id, spp, and y, then do a left_join() or merge() to recombine date (and all the other variables in my real dataframe) based on id?

解决方案

expand.grid is a useful function here,

mergedData <- merge(
    expand.grid(id = unique(df$id), spp = unique(df$spp)),
    df, by = c("id", "spp"), all =T)

mergedData[is.na(mergedData$y), ]$y <- 0

mergedData$date <- rep(levels(df$date),
                       each = length(levels(df$spp)))

Since you're not actually doing anything to subsets of the data I don't think plyr will help, maybe more efficient ways with data.table.

这篇关于如何在 R 中扩展大型数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆