如何有效地合并两个数据集? [英] How to efficiently merge two datasets?

查看:218
本文介绍了如何有效地合并两个数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过一个常见的ID将两个相当大但不可笑的(360,000 X 4,57,000 X 4) - 数据集合并在一起。我试过一个常规的 merge() merge.data.table() sqldf()。每次我的内存不足(不能分配大小的向量... )。有没有解决这个问题?还是R是合并数据的坏工具? head()如下(我正在STUDENT.NAME上合并):

  ID10 STUDENT.NAME FATHER.NAME MOTHER.NAME 
1 1 DEEKSHITH J JAYANNA SWARNA
2 4 MANIKANTHA D DEVARAJ MANJULA
3 5 NAGESH T THIMMAIAH N SHIVAMMA
4 6 NIZAMUDDIN R NOOR MOHAMMED BIBI
5 7 PRABHU YELLAPPA YELLAPPA MALLAMMA
6 8 SADDAM PASHA NISAR AHMED ZAREENA


解决方案

从你的问题的性质来看,一定要做多个合并,每个学生在每个数据框中都会发生多次。你可能想要检查多少次。如果每个学生在每个数据框中出现两次,那就意味着一个学生会做4行。如果学生发生10次,合并将添加100行。首先检查你会得到多少行。这是我使用的功能:

  count.rows<  -  function(x,y,v,all = FALSE ){
tx< - table(x [[v]])
ty < - 表(y [[v]])
val < - val < tx)[match(names(tx),names(ty),0L)> 0L]
cts< - rbind(tx [match(val,names(tx))],ty [match(val,names(ty))])
colnames(cts)
sum(apply(cts,2,prod,na.rm = all),na.rm = TRUE)
}
count.rows(DF1,DF2,STUDENT.NAME)

如果你做你所问的(阅读R文档),你会看到复杂性取决于答案的长度。这不是由于合并算法本身,而是将所有结果绑定在一起。如果你真的想要一个更少的内存饥饿的解决方案,你需要特别去摆脱那个绑定。以下算法为您做到这一点。我写了它,所以你可以找到逻辑,它可以优化。请注意,它不会产生相同的结果,它会复制两个数据框的所有列。所以你可能想要适应一点。

  mymerge<  -  function(x,y,v,count.only = FALSE){
ix< - match(v,names(x))
iy< - match(v,names(y))

xx < [,ix]
yy< - y [,iy]
ox < - order(xx)
oy< - order(yy)
xx < [ox]
yy< - yy [oy]

nx< - length(xx)
ny< - 长度(yy)

val < - unique(xx)
val < - val [match(val,yy,0L)> 0L]
cts < - cbind(table(xx)[val],table(yy)[val])
dimr< - sum(apply(cts,1,prod),na.rm = TRUE)

idx < - vector(numeric,dimr)
idy< - vector(numeric,dimr)
ndx< - embed (其中(!duplicateated(xx)),nx + 1),2)%val中的唯一(xx)%]
ndy< - embed(c(which(!duplicateated(yy)),ny +1),2)[%()中的唯一(yy)%]

count = 1
for(i in 1:nrow(ndx)){
nx< ; - abs(diff(ndx [i,]))
ny< - abs(diff(ndy [i,]))
ll < - nx * ny

idx [count:(count + ll-1)]< -
rep(ndx [i,2] :( ndx [i,1] -1),ny)

idy [count:(count + ll-1)]< -
rep(ndy [i,2] :( ndy [i,1] -1),each = nx)
count& - count + ll
}
x < - x [ox [idx],]
名称(y)< - paste(y。,names(y),sep = )
x [names(y)]< - y [oy [idy],]
rownames(x)< - 1:nrow(x)
x
} b $ b

某些测试代码如此你可以看到它有效:

  DF1<  -  data.frame(
ID = 1:10,
STUDENT.NAME =字母[1:10],
SCORE = 1:10

id < - c(3,11,4,6,6,12,1 ,4,7,10,5,3)
DF2< - data.frame(
ID = id,
STUDENT.NAME = letters [id],
SCORE = 1:12


mymerge(DF1,DF2,STUDENT.NAME)

对于每个学生名称具有最多10个匹配项的50万行和4列的两个数据帧执行相同操作,它返回一个具有580万行和8列的数据帧,并在内存中提供以下图片: p>



黄色框是合并调用,绿色框是mymerge调用。内存的范围从2.3Gb到3.74Gb,所以合并呼叫使用1.45Gb,并在0.8Gb以上。仍然没有内存不足的错误...这个测试代码如下:

 名称<  -  sapply(
复制(120000,sample(letter,4,TRUE),simplified = FALSE),
paste,collapse =)

DF1< - data.frame $ b ID10 = 1:500000,
STUDENT.NAME = sample(Names [1:50000],500000,TRUE),
FATHER.NAME = sample(letters,500000,TRUE),
SCORE1 = rnorm(500000),
stringsAsFactors = FALSE


id < - sample(500000,replace = TRUE)
DF2 < - data。框架(
ID20 = DF1 $ ID10,
STUDENT.NAME = DF1 $ STUDENT.NAME [id],
SCORE = rnorm(500000),
SCORE2 = rnorm(500000) ,
stringsAsFactors = FALSE

id2< - sample(500000,20000)
DF2 $ STUDENT.NAME [id2]< - sample(Names [100001:120000] ,20000,TRUE)

gc()
system.time(X< - merge(DF1,DF2,STUDENT.NAME))
Sys.sleep(1 )
gc()
Sys.sleep(1)
rm(X)
gc()
Sys.sleep(3)
system.time (X < mymerge(DF1,DF2,STUDENT.NAME))
Sys.sleep(1)
gc()
rm(X)
gc()


I am trying to merge two fairly large - but not ridiculously so (360,000 X 4, 57,000 X 4) - datasets by one common ID. I have tried a regular merge(), merge.data.table(), and sqldf(). Every time I keep running out of memory (cannot allocate vector of size...). Is there any solution to this? Or is R a bad tool for merging data? head() is given below (I am trying to merge on STUDENT.NAME):

  ID10    STUDENT.NAME   FATHER.NAME MOTHER.NAME
1    1     DEEKSHITH J       JAYANNA      SWARNA
2    4    MANIKANTHA D       DEVARAJ     MANJULA
3    5        NAGESH T   THIMMAIAH N    SHIVAMMA
4    6    NIZAMUDDIN R NOOR MOHAMMED        BIBI
5    7 PRABHU YELLAPPA      YELLAPPA    MALLAMMA
6    8    SADDAM PASHA   NISAR AHMED     ZAREENA

解决方案

From the nature of your problem it is bound to be that you're doing a many-by-many merge, where each student occurs many times in every dataframe. You might want to check how many times. If each student occurs twice in every data frame, that means one student will make 4 rows. if a student occurs 10 times, the merge will add 100 rows. First check how many rows you'll get. This is the function I use for that:

count.rows <- function(x,y,v,all=FALSE){
    tx <- table(x[[v]])
    ty <- table(y[[v]])
    val <- val <- names(tx)[match(names(tx),names(ty),0L) > 0L]
    cts <- rbind(tx[match(val,names(tx))],ty[match(val,names(ty))])
    colnames(cts) <- val
    sum(apply(cts,2,prod,na.rm=all),na.rm=TRUE)
}
count.rows(DF1,DF2,"STUDENT.NAME")

If you would do what you asked me (read up the R documentation), you'd see that the complexity is dependent on the length of the answer. This is not due to the merge algorithm itself, but the binding of all the results together. If you really want a less memory hungry solution, you need especially to get rid of that binding. Following algorithm does that for you. I wrote it out so you can find the logic, and it can be optimized. Mind you that it does not give the same result, it copies all columns of both dataframes. So you might want to adapt that a little.

mymerge <- function(x,y,v,count.only=FALSE){
    ix <- match(v,names(x))
    iy <- match(v,names(y))

    xx <- x[,ix]
    yy <- y[,iy]
    ox <- order(xx)
    oy <- order(yy)
    xx <- xx[ox]
    yy <- yy[oy]

    nx <- length(xx)
    ny <- length(yy)

    val <- unique(xx)
    val <- val[match(val,yy,0L) > 0L]
    cts <- cbind(table(xx)[val],table(yy)[val])
    dimr <- sum(apply(cts,1,prod),na.rm=TRUE)

    idx <- vector("numeric",dimr)
    idy <- vector("numeric",dimr)
    ndx <- embed(c(which(!duplicated(xx)),nx+1),2)[unique(xx) %in% val,]
    ndy <- embed(c(which(!duplicated(yy)),ny+1),2)[unique(yy) %in% val,]

    count = 1
    for(i in 1:nrow(ndx)){
        nx <- abs(diff(ndx[i,]))
        ny <- abs(diff(ndy[i,]))
        ll <- nx*ny

        idx[count:(count+ll-1)] <-
          rep(ndx[i,2]:(ndx[i,1]-1),ny)

        idy[count:(count+ll-1)] <-
          rep(ndy[i,2]:(ndy[i,1]-1),each=nx)
        count <- count+ll
    }
    x <- x[ox[idx],]
    names(y) <- paste("y.",names(y),sep="")
    x[names(y)] <- y[oy[idy],]
    rownames(x) <- 1:nrow(x)
    x
}

Some testing code so you can see it works :

DF1 <- data.frame(
    ID = 1:10,
    STUDENT.NAME=letters[1:10],
    SCORE = 1:10
)
id <- c(3,11,4,6,6,12,1,4,7,10,5,3)
DF2 <- data.frame(
    ID = id,
    STUDENT.NAME=letters[id],
    SCORE = 1:12
)

mymerge(DF1,DF2,"STUDENT.NAME")

Doing the same with two dataframes of 0.5 million rows and 4 columns with up to 10 matches per student name, it returns a dataframe with 5.8 million rows and 8 columns andd gives following picture on the memory :

The yellow box is the merge call, the green box is the mymerge call. Memory ranges from 2.3Gb to 3.74Gb, so the merge call uses 1.45 Gb and mymerge a bit over 0.8 Gb. Still no "out of memory" errors... The testing code for this is below :

Names <- sapply(
      replicate(120000,sample(letters,4,TRUE),simplify=FALSE),
      paste,collapse="")

DF1 <- data.frame(
    ID10 = 1:500000,
    STUDENT.NAME = sample(Names[1:50000],500000,TRUE),
    FATHER.NAME = sample(letters,500000,TRUE),
    SCORE1 = rnorm(500000),
    stringsAsFactors=FALSE
)

id <- sample(500000,replace=TRUE)
DF2 <- data.frame(
    ID20 = DF1$ID10,
    STUDENT.NAME = DF1$STUDENT.NAME[id],
    SCORE = rnorm(500000),
    SCORE2= rnorm(500000),
    stringsAsFactors=FALSE
)
id2 <- sample(500000,20000)
DF2$STUDENT.NAME[id2] <- sample(Names[100001:120000],20000,TRUE)

gc()
system.time(X <- merge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
Sys.sleep(1)
rm(X)
gc()
Sys.sleep(3)
system.time(X <- mymerge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
rm(X)
gc()

这篇关于如何有效地合并两个数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆