数据量较大时,dcast无法转换字符列 [英] dcast fails to cast character column when the data size is large
问题描述
我在 library(reshape2)
包中使用 dcast
函数来转换一个简单的三张表列
df = data.table(id = 1:1e6,
var = c('continent','次大陆',...),
val = c('America','Caribbean',...)````
通过 dcast(df,id〜var,value.var ='val')
并自动将值转换为计数,即
id大陆次大陆
1 1 1
2 1 1
但是,如果我将大小减小到10000行,它将正确输出
id大陆次大陆
1美国加勒比
2欧洲西欧
这是一个错误还是我需要以某种方式更改代码?请帮助!
问题不在于数据集本身的大小,而是完整数据集中重复条目的出现。通过从完整数据集中选择较小的子集,就有可能不包含重复项。
help( dcast, data.table )
说:
在
公式中的变量组合
不能在单元格中标识唯一值,则必须指定fun.aggregate
,默认值为length
如果未指定。
如何在完整数据集中查找重复项
所有重复项可以通过
cols--c( id, var)
df [duplicated( df,= cols)|复制的(df,by = cols,fromLast = TRUE)] [
order(id)]
< blockquote>
id var val
1:1次大陆加勒比海
2:1次大陆南美洲
请注意,我们正在寻找 id
和 var
,因为这两个形成了整形结果的单元格,即行和列。
为什么 unique()
不起作用
注意:这就是为什么仅采用 unique(df)
无效的原因:
唯一(df)
id var val
1:1美国大陆
2:1次大陆加勒比海
3:2欧洲大陆
4:2次大陆西欧
5: 1个次大陆南美
不删除任何行。因此,
dcast(unique(df),id〜var,value.var = val)
缺少聚合函数,默认为'length'
id大陆次大陆
1:1 1 2
2:2 1 1
而
唯一(df,by = cols)
id var val
1:1大洲美国
2:1次大陆加勒比海
3:2欧洲大陆
4:2次大陆西欧
删除了 id == 1L $ c $的重复
var
c>。因此,
dcast(unique(df,by = cols),id〜var,value.var = val)
内陆次大陆
1:1美国加勒比
2:2欧洲西欧
如何查找重复行的行数
OP报告说,该问题仅出现在完整数据集上,而没有出现在第一个<$中c $ c> 1e5 行。
重复条目的行索引可以通过
$ b找到$ b
哪个(重复(df,由= cols))
对于示例数据集返回 5
。对于OP的完整数据集,我怀疑
min(which(duplicated(df,by = cols)))> 1e5
为true,即前1e5行中没有重复项。
即使在重复输入的情况下如何创建字符列
OP自己的方法,使用 fun.aggregate = function(x)paste(x [1L])
并应用<$ c $ df
上的c> unique()只是旨在删除所有令人不安的重复项。
或者, toString()
可以用作显示重复项的聚合函数。条目:
dcast(df,id〜var,toString,value.var = val)
id大陆次大陆
1:1美国加勒比海地区,南美
2:2欧洲西欧
数据
库(data.table)
df<-data.table(
id = c(1L,1L,2L,2L,1L),
var = c( continent, subcontinent, continent, subcontinent, subcontinent),
val = c( 美国,加勒比海,欧洲,西欧,南美)
)
df
id var val
1:1美国大陆
2:1次大陆加勒比海
3:2大洲欧洲
4:次大陆2西欧
5:南美次大陆
I'm using the dcast
function in the library(reshape2)
package to cast a simple table of three columns
df = data.table(id = 1:1e6,
var = c('continent','subcontinent',...),
val = c('America','Caribbean',...)````
by dcast(df, id ~ var, value.var ='val')
and it automatically converts the value to the count, i.e.
id continent subcontinent
1 1 1
2 1 1
However, if I reduce the size to 10000 rows, it correctly outputs
id continent subcontinent
1 America Caribbean
2 Europe West Europe
Is this a bug or I need to change the code somehow? Please help. Thanks!
The problem is not the size of the dataset itself but the occurrence of duplicate entries in the full dataset. By picking smaller subsets from the full dataset there is a chance that no duplicates are included.
help("dcast", "data.table")
says:
When variable combinations in
formula
doesn't identify a unique value in a cell,fun.aggregate
will have to be specified, which defaults tolength
if unspecified.
How to find duplicates in the full dataset
All occurrences of duplicates can be identified by
cols <- c("id", "var")
df[duplicated(df, by = cols) | duplicated(df, by = cols, fromLast = TRUE)][
order(id)]
id var val 1: 1 subcontinent Caribbean 2: 1 subcontinent South America
Note that we are looking for duplicates in id
and var
as these two form the cells, i.e., rows and columns, of the reshaped result.
Why unique()
doesn't work
NB: This is the explanation why simply taking unique(df)
will not work:
unique(df)
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe 5: 1 subcontinent South America
does not remove any rows. Consequently,
dcast(unique(df), id ~ var, value.var = "val")
Aggregate function missing, defaulting to 'length' id continent subcontinent 1: 1 1 2 2: 2 1 1
Whereas
unique(df, by = cols)
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe
has removed the duplicate var
for id == 1L
. Consequently,
dcast(unique(df, by = cols), id ~ var, value.var = "val")
id continent subcontinent 1: 1 America Caribbean 2: 2 Europe West Europe
How to find the row numbers of duplicated rows
The OP has reported that the issue appears only with the full dataset but not with a subset of the first 1e5
rows.
The row indices of the duplicate entries can be found by
which(duplicated(df, by = cols))
which returns 5
for the sample dataset. For OP's full dataset, I suspect that
min(which(duplicated(df, by = cols))) > 1e5
is true, i.e., there are no duplicates within the first 1e5 rows.
How to create character columns even in case of duplicate entries
OP's own approach using fun.aggregate = function(x) paste(x[1L])
as well as applying unique()
on df
just aim at removing any disturbing duplicates. The duplicates will be silently dropped.
Alternatively, toString()
can be used as aggregation function which shows the duplicate entries:
dcast(df, id ~ var, toString, value.var = "val")
id continent subcontinent 1: 1 America Caribbean, South America 2: 2 Europe West Europe
Data
library(data.table)
df <- data.table(
id = c(1L, 1L, 2L, 2L, 1L),
var = c("continent", "subcontinent", "continent", "subcontinent", "subcontinent"),
val = c("America", "Caribbean", "Europe", "West Europe", "South America")
)
df
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe 5: 1 subcontinent South America
这篇关于数据量较大时,dcast无法转换字符列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!