为什么rbindlist不尊重列名? [英] Why does rbindlist not respect column names?

查看:131
本文介绍了为什么rbindlist不尊重列名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现这个错误,只是发现有些人称之为特征。这使 rbindlist 不像 do.call(rbind,l) as rbind 将尊重列名。此外,在文档中没有提到这个完全意想不到的行为。这是真的有意吗?



代码示例:

  library(data.table)
> DT1 < - data.table(a = 1,b = 2)
> DT2 < - data.table(b = 3,a = 4)
> DT1
a b
1:1 2
> DT2
ba
1:3 4

我期望 rbind '这些将产生a = 1,4; b = 2,3。使用 rbind.data.table rbind.data.frame ,尽管 rbind .data.table 产生警告。

  rbind(DT1,DT2)
ab
1:1 2
2:4 3
警告消息:
在data.table ::。rbind.data.table ...):
参数2具有不同顺序的名称。列将由名称绑定以与base的一致性。您可以删除名称(通过使用未命名的列表),然后列将按位置连接,或设置use.names = FALSE。或者,将use.names显式设置为TRUE将删除此警告。
> rbind(as.data.frame(DT1),as.data.frame(DT2))
a b
1 1 2
2 4 3
> do.call('rbind',list(DT1,DT2))
ab
1:1 2
2:4 3
警告消息:
在data.table ::。rbind.data.table(...):
参数2的名称以不同的顺序。列将由名称绑定以与base的一致性。您可以删除名称(通过使用未命名的列表),然后列将按位置连接,或设置use.names = FALSE。或者,将use.names显式设置为TRUE将删除此警告。

rbindlist 损坏数据:

 > rbindlist(list(DT1,DT2))
ab
1:1 2
2:3 4


解决方案

此功能现在在 commit 1266 of v1.9.3 。从新闻: / h3>

  o'rbindlist'gains'use.names'和'fill'参数,现在实现
完全在C.关闭#5249
- > use.names默认情况下为FALSE以实现向后兼容性(默认情况下不绑定
名称)
- > rbind(...)现在只是在内部调用rbindlist(),除了'use.names'
默认为TRUE,为了兼容base(和向后兼容性)。
- >填充默认值为FALSE。如果fill是TRUE,则use.names必须为TRUE。
- >输入列表的至少一个项必须具有非空列名。
- >重复的列按出现的顺序绑定,如base。
- >可能存在于单个项目中的属性将在绑定结果中丢失。
- >如果/如果可能,列强制为最高SEXPTYPE,如果它们不同。
- >和令人难以置信的快;)。
- >文档更新了很多。关闭DR#5158。

有了这个,你可以设置 use.names = TRUE 以名称绑定。为了向后兼容,默认设置为 FALSE 。或者,您可以使用 rbind(..)其中 use.names = TRUE p>

请参阅此帖子了解更多示例,并此帖子了解基准。



示例:



1)只需设置 use.names = TRUE

  DT1  DT2  
rbindlist(list(DT1,DT2),use.names = TRUE,fill = FALSE)
#xy
#1:1 2
# 2:2 1

DT1< - data.table(x = 1,y = 2)
DT2
#当fill = FALSE时返回错误,但不能绑定无fill = TRUE
rbindlist(list(DT1,DT2),use.names = TRUE,fill = FALSE)
#错误在rbindlist(列表(DT1,DT2),use.names = TRUE,填充= FALSE):
#答案需要3列,而输入
#列表中的一个或多个项目只有2列。 ...






按照出现顺序:

  DT1  DT2 < -  data.table(y = -10,x = -2,y = -20,x = -1,y = -30)

rbindlist(list(DT1,DT2),use.names = TRUE)

#xxyyy
#1:1 2 10 20 30
#2:-2 - 1 -10 -20 -30






fill = TRUE 如果要通过名称绑定并填充缺少的列

  DT1  DT2  
rbindlist DT1,DT2),fill = TRUE)
#xyz
#1:1 2 NA
#2:NA 2 -1

HTH


I just discovered this bug, only to find that some people are calling it a "feature". This makes rbindlist NOT like do.call("rbind",l) as rbind WILL respect column names. Further, there is no mention of this entirely unexpected behavior in the documentation. Is this really intentional?

Code example:

> library(data.table)
> DT1 <- data.table(a=1, b=2)
> DT2 <- data.table(b=3, a=4)
> DT1
a b
1: 1 2
> DT2
b a
1: 3 4

I would expect that rbind'ing these would produce columns with a = 1,4 ; b = 2,3. And get that with rbind.data.table and rbind.data.frame, though rbind.data.table produces warnings.

> rbind(DT1, DT2)
a b
1: 1 2
2: 4 3
Warning message:
In data.table::.rbind.data.table(...) :
Argument 2 has names in a different order. Columns will be bound by name for consistency with base. You can drop names (by using an unnamed list) and the columns will then be joined by position, or set use.names=FALSE. Alternatively, explicitly setting use.names to TRUE will remove this warning.
> rbind(as.data.frame(DT1), as.data.frame(DT2))
a b
1 1 2
2 4 3
> do.call('rbind', list(DT1, DT2))
a b
1: 1 2
2: 4 3
Warning message:
In data.table::.rbind.data.table(...) :
Argument 2 has names in a different order. Columns will be bound by name for consistency with base. You can drop names (by using an unnamed list) and the columns will then be joined by position, or set use.names=FALSE. Alternatively, explicitly setting use.names to TRUE will remove this warning.

rbindlist, however, is happy to silently corrupt the data:

> rbindlist(list(DT1, DT2))
a b
1: 1 2
2: 3 4

解决方案

This feature is now implemented in commit 1266 of v1.9.3. From NEWS:

o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented 
   entirely in C. Closes #5249    
  -> use.names by default is FALSE for backwards compatibility (doesn't bind by 
     names by default)
  -> rbind(...) now just calls rbindlist() internally, except that 'use.names' 
     is TRUE by default, for compatibility with base (and backwards compatibility).
  -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
  -> At least one item of the input list has to have non-null column names.
  -> Duplicate columns are bound in the order of occurrence, like base.
  -> Attributes that might exist in individual items would be lost in the bound result.
  -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
  -> And incredibly fast ;).
  -> Documentation updated in much detail. Closes DR #5158.

With this, you can set use.names=TRUE to bind by names. It's set to FALSE by default for backwards compatibility. Alternatively, you can use rbind(..) where use.names=TRUE, again for backwards compatibility.

See this post for more examples and this post for benchmarks.

Examples:

1) Just set use.names=TRUE

DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=1, x=2)

rbindlist(list(DT1,DT2), use.names=TRUE, fill=FALSE)
#    x y
# 1: 1 2
# 2: 2 1

DT1 <- data.table(x=1, y=2)
DT2 <- data.table(z=2, y=1)

# returns error when fill=FALSE but can't be bound without fill=TRUE
rbindlist(list(DT1, DT2), use.names=TRUE, fill=FALSE)
# Error in rbindlist(list(DT1, DT2), use.names = TRUE, fill = FALSE) : 
    # Answer requires 3 columns whereas one or more item(s) in the input 
    # list has only 2 columns. ...


2) Also binds duplicate column names in the order of occurrence:

DT1 <- data.table(x=1, x=2, y=10, y=20, y=30)
DT2 <- data.table(y=-10, x=-2, y=-20, x=-1, y=-30)

rbindlist(list(DT1,DT2), use.names=TRUE)

#     x  x   y   y   y
# 1:  1  2  10  20  30
# 2: -2 -1 -10 -20 -30


3) use fill=TRUE if you want to bind by names and fill missing columns

DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=2, z=-1)

rbindlist(list(DT1, DT2), fill=TRUE)
#     x y  z
# 1:  1 2 NA
# 2: NA 2 -1

HTH

这篇关于为什么rbindlist不尊重列名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆