数据量较大时,dcast无法转换字符列 [英] dcast fails to cast character column when the data size is large

查看:254
本文介绍了数据量较大时,dcast无法转换字符列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 library(reshape2)包中使用 dcast 函数来转换一个简单的三张表列

  df = data.table(id = 1:1e6,
var = c('continent','次大陆',...),
val = c('America','Caribbean',...)````

通过 dcast(df,id〜var,value.var ='val')并自动将值转换为计数,即

  id大陆次大陆
1 1 1
2 1 1

但是,如果我将大小减小到10000行,它将正确输出

  id大陆次大陆
1美国加勒比
2欧洲西欧

这是一个错误还是我需要以某种方式更改代码?请帮助!

解决方案

问题不在于数据集本身的大小,而是完整数据集中重复条目的出现。通过从完整数据集中选择较小的子集,就有可能不包含重复项。



help( dcast, data.table )说:


公式中的变量组合不能在单元格中标识唯一值,则必须指定 fun.aggregate ,默认值为 length 如果未指定。




如何在完整数据集中查找重复项



所有重复项可以通过

  cols--c( id, var)
df [duplicated( df,= cols)|复制的(df,by = cols,fromLast = TRUE)] [
order(id)]



< blockquote>

  id var val 
1:1次大陆加勒比海
2:1次大陆南美洲


请注意,我们正在寻找 id var ,因为这两个形成了整形结果的单元格,即行和列。



为什么 unique()不起作用



注意:这就是为什么仅采用 unique(df)无效的原因:

 唯一(df)




  id var val 
1:1美国大陆
2:1次大陆加勒比海
3:2欧洲大陆
4:2次大陆西欧
5: 1个次大陆南美


不删除任何行。因此,

  dcast(unique(df),id〜var,value.var = val)




 缺少聚合函数,默认为'length'
id大陆次大陆
1:1 1 2
2:2 1 1


 唯一(df,by = cols)




  id var val 
1:1大洲美国
2:1次大陆加勒比海
3:2欧洲大陆
4:2次大陆西欧


删除了 id == 1L var c>。因此,

  dcast(unique(df,by = cols),id〜var,value.var = val)




 内陆次大陆
1:1美国加勒比
2:2欧洲西欧




如何查找重复行的行数



OP报告说,该问题仅出现在完整数据集上,而没有出现在第一个<$中c $ c> 1e5 行。



重复条目的行索引可以通过


$ b找到$ b

 哪个(重复(df,由= cols))

对于示例数据集返回 5 。对于OP的完整数据集,我怀疑

  min(which(duplicated(df,by = cols)))> 1e5 

为true,即前1e5行中没有重复项。



即使在重复输入的情况下如何创建字符列



OP自己的方法,使用 fun.aggregate = function(x)paste(x [1L])并应用<$ c $ df 上的c> unique()只是旨在删除所有令人不安的重复项。



或者, toString()可以用作显示重复项的聚合函数。条目:

  dcast(df,id〜var,toString,value.var = val)




  id大陆次大陆
1:1美国加勒比海地区,南美
2:2欧洲西欧




数据



 库(data.table)
df<-data.table(
id = c(1L,1L,2L,2L,1L),
var = c( continent, subcontinent, continent, subcontinent, subcontinent),
val = c( 美国,加勒比海,欧洲,西欧,南美)


df




  id var val 
1:1美国大陆
2:1次大陆加勒比海
3:2大洲欧洲
4:次大陆2西欧
5:南美次大陆



I'm using the dcast function in the library(reshape2) package to cast a simple table of three columns

df = data.table(id  = 1:1e6, 
             var = c('continent','subcontinent',...), 
             val = c('America','Caribbean',...)````

by dcast(df, id ~ var, value.var ='val') and it automatically converts the value to the count, i.e.

id     continent   subcontinent
 1     1           1
 2     1           1

However, if I reduce the size to 10000 rows, it correctly outputs

id     continent   subcontinent
 1     America     Caribbean
 2     Europe      West Europe

Is this a bug or I need to change the code somehow? Please help. Thanks!

解决方案

The problem is not the size of the dataset itself but the occurrence of duplicate entries in the full dataset. By picking smaller subsets from the full dataset there is a chance that no duplicates are included.

help("dcast", "data.table") says:

When variable combinations in formula doesn't identify a unique value in a cell, fun.aggregate will have to be specified, which defaults to length if unspecified.

How to find duplicates in the full dataset

All occurrences of duplicates can be identified by

cols <- c("id", "var")
df[duplicated(df, by = cols) | duplicated(df, by = cols, fromLast = TRUE)][
  order(id)]

   id          var           val
1:  1 subcontinent     Caribbean
2:  1 subcontinent South America

Note that we are looking for duplicates in id and var as these two form the cells, i.e., rows and columns, of the reshaped result.

Why unique() doesn't work

NB: This is the explanation why simply taking unique(df) will not work:

unique(df)

   id          var           val
1:  1    continent       America
2:  1 subcontinent     Caribbean
3:  2    continent        Europe
4:  2 subcontinent   West Europe
5:  1 subcontinent South America

does not remove any rows. Consequently,

dcast(unique(df), id ~ var, value.var = "val")

Aggregate function missing, defaulting to 'length'
   id continent subcontinent
1:  1         1            2
2:  2         1            1

Whereas

unique(df, by = cols)

   id          var         val
1:  1    continent     America
2:  1 subcontinent   Caribbean
3:  2    continent      Europe
4:  2 subcontinent West Europe

has removed the duplicate var for id == 1L. Consequently,

dcast(unique(df, by = cols), id ~ var, value.var = "val")

   id continent subcontinent
1:  1   America    Caribbean
2:  2    Europe  West Europe

How to find the row numbers of duplicated rows

The OP has reported that the issue appears only with the full dataset but not with a subset of the first 1e5 rows.

The row indices of the duplicate entries can be found by

which(duplicated(df, by = cols))

which returns 5 for the sample dataset. For OP's full dataset, I suspect that

min(which(duplicated(df, by = cols))) > 1e5

is true, i.e., there are no duplicates within the first 1e5 rows.

How to create character columns even in case of duplicate entries

OP's own approach using fun.aggregate = function(x) paste(x[1L]) as well as applying unique() on df just aim at removing any disturbing duplicates. The duplicates will be silently dropped.

Alternatively, toString() can be used as aggregation function which shows the duplicate entries:

dcast(df, id ~ var, toString, value.var = "val")

   id continent             subcontinent
1:  1   America Caribbean, South America
2:  2    Europe              West Europe

Data

library(data.table)
df <- data.table(
  id  = c(1L, 1L, 2L, 2L, 1L),
  var = c("continent", "subcontinent", "continent", "subcontinent", "subcontinent"),
  val = c("America", "Caribbean", "Europe", "West Europe", "South America")
)

df

   id          var           val
1:  1    continent       America
2:  1 subcontinent     Caribbean
3:  2    continent        Europe
4:  2 subcontinent   West Europe
5:  1 subcontinent South America

这篇关于数据量较大时,dcast无法转换字符列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆