使用“data.table”合并数据帧时出错。包 [英] Error while merging data frames using "data.table" package

查看:522
本文介绍了使用“data.table”合并数据帧时出错。包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我遇到并困扰的情况的可重现示例(它是一个测试客户端我正在使用它来评估各种方法合并数据集,用于我的论文研究)。

  testData < - https://github.com/ abnova / test / blob / master / mergeTestData.zip?raw = true

tmpFile< - tempfile()
tmpDir< - tempdir()

download.file(testData,tmpFile,method ='curl',
extra ='-L',quiet = TRUE)
testFiles< - unzip(tmpFile,exdir = tmpDir)
b $ b#要启用所需的合并选项,请取消注释对应行

#MERGE_OPTION< - lapply_merge
#MERGE_OPTION< - lapply_merge2
#MERGE_OPTION< reduce_merge
#MERGE_OPTION< - reduce_merge2
#MERGE_OPTION< - reshape
#MERGE_OPTION< - plyr
#MERGE_OPTION< - dplyr
MERGE_OPTION< - data.table
#MERGE_OPTION< - data.table2

loadData< - function(dataFile){

if(file.exists(dataFile)){
data< - readRDS(dataFile)
}
else {#error()undefined - 替换为stop b $ b stop(Data file \',dataFile,\not found!运行'make'first。)
}
return(data)
}

loadDataSets < - function(dataDir){
$ b b dataSets< - list()

dataFiles< - dir(dataDir,pattern ='\\.rds $')
dataSets< - lapply(seq_along ,
function(i){
nameSplit < - strsplit(dataFiles [i],\\。)
dataset < - nameSplit [[1]] [1]
assign(dataset,
loadData(file.path(dataDir,dataFiles [i]))
return(get(dataset))
})
return数据集)
}

#加载转换数据的数据集
dataSets < - loadDataSets(tmpDir)

if(MERGE_OPTION ==lapply_merge ){#选项1

flossData < - data.frame(dataSets [[1]] [1])$ ​​b
$ b#通过公共列合并所有加载的数据集ID)
silent< - lapply(seq(2,length(dataSets)),
function(i){merge(flossData,dataSets [[1]] [i],
by =Project ID,
all = TRUE)})
}

if(MERGE_OPTION ==lapply_merge2){#选项1
$ b b pids< - which(sapply(dataSets,
FUN = function(x){'%'项目ID'%在%名称(x)}中))

flossData < [pids [1]]]

for(id in pids [2:length(pids)]){
flossData < - merge(flossData,dataSets [[id]],
by ='Project ID',all = TRUE)
}
}

if(MERGE_OPTION ==reduce_merge){#Option 2

flossData< - Reduce(function(...)
merge(...,by.x =row.names,by.y =Project ID,all = TRUE),
dataSets)
}

#http://r.789695.n4.nabble.com/merge-multiple-data-frames-tt4331089.html#a4333772
if(MERGE_OPTION ==reduce_merge2){#Option 2

mergeAll< - function(...,by =Project ID,all = TRUE){
dotArgs& - list(...)
dotNames< - lapply(dotArgs,names)
repNames< - Reduce(intersect,dotNames)
repNames< - repNames [repNames!= by]
for(i in seq_along(dotArgs)){
wn< - which((names(dotArgs [[i]])%in%repNames)&
(names(dotArgs [[i]])!= by))
names(dotArgs [[i]])[wn] (x,y)合并(x,y,by = by,xn,y) all = all),dotArgs)
}

flossData < - mergeAll(dataSets)
}

if(MERGE_OPTION ==reshape) {#选项3

if(!suppressMessages(require(reshape)))install.packages('reshape')
library(reshape)
flossData< - reshape :: install_all(dataSets)
}

if(MERGE_OPTION ==plyr){#Option 4

if(!suppressMessages(require(plyr)))install .packages('plyr')
library(plyr)
flossData < - plyr :: join_all(dataSets)
}

if(MERGE_OPTION ==dplyr ){#Option 5

if(!suppressMessages(require(dplyr)))install.packages('dplyr')
library(dplyr)

flossData < - dataSets [[1]] [1]
flossData < - lapply(dataSets [[1]] [ - 1],
function(x){dplyr :: left_join(x,flossData )})
}

if(MERGE_OPTION ==data.table){#选项6

if(!suppressMessages(require(data.table) )
install.packages('data.table')
library(data.table)

flossData < - data.table(dataSets [[1]],key =Project ID)

for(id in 2:length(dataSets)){
flossData< - merge(flossData,data.table(dataSets [[id]
by ='Project ID',all.x = TRUE,all.y = FALSE)
}
}

#http://stackoverflow.com/ a / 17458887/2872891
if(MERGE_OPTION ==data.table2){#选项6

if(!suppressMessages(require(data.table)))
install .packages('data.table')
library(data.table)

DT< - data.table(dataSets [[1]],key =Project ID)
flossData < - lapply(dataSets [[1]] [ - 1],function(x)DT [。(x)])
}

# TODO上面)

#将Repo URL的存在转换为整数
flossData [[Repo URL]]< - as.integer(flossData [[Repo URL]]!=

#convert许可限制因素级别为整数
#flossData [[许可限制]] < -
#as.integer(flossData [[许可限制]])

#转换用户社区大小从字符到整数
flossData [[用户社区大小]] < -
as.integer(flossData [ [User Community Size]])

#remove NAs
#flossData < - flossData [complete.cases(flossData [,3]),]
rowsNA& - apply(flossData,1,function(x){any(is.na(x))})
flossData< - flossData [!rowsNA,]

print ))

错误讯息如下:

 开始bmerge ...以0.001秒为单位
开始bmerge ...以0.002秒为单位
在vecseq中出错,len__,if(allow.cartesian)NULL else as.integer(max(nrow(x),:



< blockquote>

在121229行中结果:超过100000 =
max(nrow(x),nrow(i))。检查i中的重复键值,每个
重复加入x中的同一个组。如果没关系,
尝试包括 j ,并通过(by-without-by)删除运行为
每个组避免大的分配。如果你确定你希望
继续,重新运行allow.cartesian = TRUE。否则,请在FAQ,Wiki,Stack Overflow和datatable-help
中搜索
此错误消息以获取建议。



$ b b

当前问题是启用了 data.table 选项,但是,由于它是同一个包,下一个选项使用替代 data.table 语法进行合并虽然我觉得太混乱了,但是为了知识的完整性)。提前感谢!

解决方案

我以这种方式处理问题:



首先,有一个错误消息。


加入结果121229行;超过100000 = max(nrow(x),nrow(i))。检查i中的重复键值,每次重复连接到x中的同一个组。如果没关系,尝试包括j并且丢弃(by-without-by),这样j为每个组运行以避免大的分配。如果您确定要继续,请使用allow.cartesian = TRUE重新运行。否则,请在FAQ,Wiki,Stack Overflow和datatable-help中搜索此错误消息以获得建议。


但我有这么多的数据集我正在使用,这么多的包和这么多的功能。



逐个测试:



  ans1 = merge(as.data.table(dataSets [[1]]),as.data.table(dataSets [[2]]),
all.x = TRUE,all.y = FALSE,by =Project ID)
##工作正常。

ans2 = merge(as.data.table(dataSets [[1]]),as.data.table(dataSets [[3]]),
all.x = all.y = FALSE,by =Project ID)
##同一错误

Aha,得到了同样的错误。



读取错误讯息的第二行:



所以,似乎发生了 dataSets [[3]] 。它说要检查 i 中的重复键值。让我们这样做:

  dim(dataSets [[3]])
#[1] 81487 3
dim(as.data.table(dataSets [[3]],by =Project ID))
#[1] 49999 3
pre>

因此, dataSets [[3]] 重复了Project ID值,因此每个重复的值,则返回 dataSets [[1]] 中所有匹配的行 - 这是第二行的第二部分解释:



试用 allow.cartesian = TRUE



我知道有重复的键,但仍然希望继续。但是错误消息提到我们可以继续,添加allow.cartesian = TRUE。

  ans2 = merge .table(dataSets [[1]]),as.data.table(dataSets [[3]]),
all.x = TRUE,all.y = FALSE,by =Project ID cartesian = TRUE)

Aha,现在工作正常!那么 allow.cartesian = TRUE 是什么?或者为什么添加它?



搜索 allow.cartesian = TRUE on SO:



搜索将我归入此为什么是allow.cartesian需要在什么时候加入data.tables与重复键?问题,这解释了目的,并且还包含,在注释下的另一个链接从@Roland:合并data.tables使用超过10 GB RAM ,指向所有启动它的初始问题。现在让我阅读这些帖子。






base :: merge 给出不同的结果?



现在,base :: merge是否返回不同的结果(有100,000行)?

  dim(merge(dataSets [[1]],dataSets [[3]],all.x = TRUE,all.y = FALSE,by =Project ID)) 
#[1] 121229 4

不是真的。它给出了与使用 data.table 时相同的维度,但它不关心是否有重复的键,而 data.table 警告您合并结果的潜在爆炸,并允许您做出明智的决定。


The following is a reproducible example of a situation that I'm experiencing and stuck with (it's a test client I'm using to evaluate various approaches to merging datasets for my dissertation research).

testData <- "https://github.com/abnova/test/blob/master/mergeTestData.zip?raw=true"

tmpFile <- tempfile()
tmpDir <- tempdir()

download.file(testData, tmpFile, method = 'curl',
              extra = '-L', quiet = TRUE)
testFiles <- unzip(tmpFile, exdir = tmpDir)

# To enable desired merge option, uncomment corresponding line

#MERGE_OPTION <- "lapply_merge"
#MERGE_OPTION <- "lapply_merge2"
#MERGE_OPTION <- "reduce_merge"
#MERGE_OPTION <- "reduce_merge2"
#MERGE_OPTION <- "reshape"
#MERGE_OPTION <- "plyr"
#MERGE_OPTION <- "dplyr"
MERGE_OPTION <- "data.table"
#MERGE_OPTION <- "data.table2"

loadData <- function (dataFile) {

  if (file.exists(dataFile)) {
    data <- readRDS(dataFile)
  }
  else { # error() undefined - replaced for stop() for now
    stop("Data file \'", dataFile, "\' not found! Run 'make' first.")
  }
  return (data)
}

loadDataSets <- function (dataDir) {

  dataSets <- list()

  dataFiles <- dir(dataDir, pattern='\\.rds$')
  dataSets <- lapply(seq_along(dataFiles),
                     function(i) {
                       nameSplit <- strsplit(dataFiles[i], "\\.")
                       dataset <- nameSplit[[1]][1]
                       assign(dataset,
                              loadData(file.path(dataDir, dataFiles[i])))
                       return (get(dataset))
                     })
  return (dataSets)
}

# load the datasets of transformed data
dataSets <- loadDataSets(tmpDir)

if (MERGE_OPTION == "lapply_merge") { # Option 1

  flossData <- data.frame(dataSets[[1]][1])

  # merge all loaded datasets by common column ("Project ID")
  silent <- lapply(seq(2, length(dataSets)),
                   function(i) {merge(flossData, dataSets[[1]][i],
                                      by = "Project ID",
                                      all = TRUE)})
}

if (MERGE_OPTION == "lapply_merge2") { # Option 1

  pids <- which(sapply(dataSets,
                       FUN=function(x) {'Project ID' %in% names(x)}))

  flossData <- dataSets[[pids[1]]]

  for (id in pids[2:length(pids)]) {
    flossData <- merge(flossData, dataSets[[id]],
                       by='Project ID', all = TRUE)
  }
}

if (MERGE_OPTION == "reduce_merge") { # Option 2

  flossData <- Reduce(function(...) 
    merge(..., by.x = "row.names", by.y = "Project ID", all = TRUE),
    dataSets)
}

# http://r.789695.n4.nabble.com/merge-multiple-data-frames-tt4331089.html#a4333772
if (MERGE_OPTION == "reduce_merge2") { # Option 2

    mergeAll <- function(..., by = "Project ID", all = TRUE) {
    dotArgs <- list(...)
    dotNames <- lapply(dotArgs, names)
    repNames <- Reduce(intersect, dotNames)
    repNames <- repNames[repNames != by]
    for(i in seq_along(dotArgs)){
      wn <- which( (names(dotArgs[[i]]) %in% repNames) &
                     (names(dotArgs[[i]]) != by))
      names(dotArgs[[i]])[wn] <- paste(names(dotArgs[[i]])[wn],
                                       names(dotArgs)[[i]], sep = ".")
    }
    Reduce(function(x, y) merge(x, y, by = by, all = all), dotArgs)
  }

  flossData <- mergeAll(dataSets)
}

if (MERGE_OPTION == "reshape") { # Option 3

  if (!suppressMessages(require(reshape))) install.packages('reshape')
  library(reshape)
  flossData <- reshape::merge_all(dataSets)
}

if (MERGE_OPTION == "plyr") { # Option 4

  if (!suppressMessages(require(plyr))) install.packages('plyr')
  library(plyr)
  flossData <- plyr::join_all(dataSets)
}

if (MERGE_OPTION == "dplyr") { # Option 5

  if (!suppressMessages(require(dplyr))) install.packages('dplyr')
  library(dplyr)

  flossData <- dataSets[[1]][1]
  flossData <- lapply(dataSets[[1]][-1],
                      function(x) {dplyr::left_join(x, flossData)})
}

if (MERGE_OPTION == "data.table") { # Option 6

  if (!suppressMessages(require(data.table))) 
    install.packages('data.table')
  library(data.table)

  flossData <- data.table(dataSets[[1]], key="Project ID")

  for (id in 2:length(dataSets)) {
    flossData <- merge(flossData, data.table(dataSets[[id]]),
                       by='Project ID', all.x = TRUE, all.y = FALSE)
  }
}

# http://stackoverflow.com/a/17458887/2872891
if (MERGE_OPTION == "data.table2") { # Option 6

  if (!suppressMessages(require(data.table))) 
    install.packages('data.table')
  library(data.table)

  DT <- data.table(dataSets[[1]], key="Project ID")
  flossData <- lapply(dataSets[[1]][-1], function(x) DT[.(x)])
}

# Additional Transformations (see TODO above)

# convert presence of Repo URL to integer
flossData[["Repo URL"]] <- as.integer(flossData[["Repo URL"]] != "")

# convert License Restrictiveness' factor levels to integers
#flossData[["License Restrictiveness"]] <- 
#  as.integer(flossData[["License Restrictiveness"]])

# convert User Community Size from character to integer
flossData[["User Community Size"]] <- 
  as.integer(flossData[["User Community Size"]])

# remove NAs
#flossData <- flossData[complete.cases(flossData[,3]),]
rowsNA <- apply(flossData, 1, function(x) {any(is.na(x))})
flossData <- flossData[!rowsNA,]

print(str(flossData))

The error message is as follows:

Starting bmerge ...done in 0.001 secs
Starting bmerge ...done in 0.002 secs
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  : 

Join results in 121229 rows; more than 100000 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

Current problem is with the enabled data.table option, but, since it's the same package, I would also appreciate an advice on the next option, which uses an alternative data.table syntax for merging (even though I find it too confusing, but for the sake of knowledge completeness). Thank you in advance!

解决方案

I'd approach the issue in this manner:

First, there's an error message. What does it say?

Join results in 121229 rows; more than 100000 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

Great! But I've so many datasets I'm working with, and so many packages and so many functions. I've got to narrow this down to which data set produces this error.

Testing one by one:

ans1 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[2]]), 
                all.x=TRUE, all.y=FALSE, by="Project ID")
## works fine.

ans2 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[3]]), 
                all.x=TRUE, all.y=FALSE, by="Project ID")
## same error

Aha, got the same error.

Reading the second line of the error message:

So, something seems to happen with dataSets[[3]]. It says to check for duplicate key values in i. Let's do that:

dim(dataSets[[3]])
# [1] 81487     3
dim(unique(as.data.table(dataSets[[3]]), by="Project ID"))
# [1] 49999     3

So, dataSets[[3]] has duplicated 'Project ID' values, and so for each duplicated value, all the matching rows from dataSets[[1]] is returned - which is what the 2nd part of the 2nd line explains: each of which join to the same group in x over and over again.

Trying out allow.cartesian=TRUE:

I know that there are duplicate keys and still wish to proceed. But the error message mentions how we can proceed, add "allow.cartesian=TRUE".

ans2 = merge(as.data.table(dataSets[[1]]), as.data.table(dataSets[[3]]), 
                all.x=TRUE, all.y=FALSE, by="Project ID", allow.cartesian=TRUE)

Aha, now it works fine! So what does allow.cartesian = TRUE do? Or why was it added? The error message says to search for the message on stackoverflow (amidst other things).

Searching for allow.cartesian=TRUE on SO:

And the search lands me in on to this Why is allow.cartesian required at times when when joining data.tables with duplicate keys? question, which explains the purpose, and which also contains, under the comment, another link from @Roland: Merging data.tables uses more than 10 GB RAM which points to the initial issue that all started it. Let me read those posts now.


Is base::merge giving a different result?

Now, does base::merge return a different result (with 100,000 rows)?

dim(merge(dataSets[[1]], dataSets[[3]], all.x=TRUE, all.y=FALSE, by="Project ID"))
# [1] 121229      4

Not really. It's giving the same dimension as when using data.table, but it just doesn't care if there are duplicate keys, whereas data.table warns you of potential explosion of the merged results and allows you to make an informed decision.

这篇关于使用“data.table”合并数据帧时出错。包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆