Data.table,逻辑比较和编码在非英语环境中的错误/错误 [英] Data.table, logical comparison and encoding bugs/errors in non-English environment

查看:213
本文介绍了Data.table,逻辑比较和编码在非英语环境中的错误/错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据表给出警告,即使编码不混合且已知。只有当合并没有给出任何警告时,两者都将编码设置为unknown。这似乎不正确,逻辑比较似乎行为不同,并忽略编码。

Data table gives a warning, even if encodings are not mixed and are known. The only time a merge doesn't give any warning is when the encoding is set to unknown on both of them. This doesn't seem to be right, logical comparisons seems to act differently and ignores encoding.

我有两个问题,为什么data-table有这种行为时,两个编码是已知的和相同的。我想这是一个错误的基础上的警告(虽然是一个小的)?

I have two questions, why does data-table have this behavior when both encodings are known and the same. I guess it's a bug on the basis of the warning (albeit a small one)?

最后一次合并,失败可能是所需的行为,但不应该那么逻辑比较也会失败?这带给我第二个问题,与data.table连接和逻辑比较有什么区别,因为在我上次合并他们给不同的结果?

The last merge, that fails is perhaps desired behavior, but shouldn't then the logical comparison also fail? Which brings me to the second question, what's the difference with a data.table join and a logical comparison since in my last merge they give different results?

面对编码问题,逻辑比较似乎更加强大。

Logical comparisons seems more robust in face of encoding issues.

代码和可重新生成的输出如下。 sessionInfo()

Code and re-producable output below. sessionInfo() below that.

library("data.table")

d.tst <- data.table(Nr = c("ÅÄÖ", "ÄÖR"))
d.tst2 <- data.table(Nr2 = c("ÅÄÖ", "ÄÖR"),
                     Dat = c(1, 2))

Encoding(d.tst$Nr)
# [1] "latin1" "latin1"
Encoding(d.tst2$Nr2)
# [1] "latin1" "latin1"

d.tst[1]$Nr == d.tst2[1]$Nr2
# [1] TRUE
a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")




警告讯息:
In bmerge(i,x,leftcols,rightcols,io,xo,roll,rollends,nomatch,
在连接列中检测到已知的编码(latin1或UTF- 。
data.table比较当前的字节,因此不支持 mixed
编码;即使用latin1和UTF-8,或者任何未知的

编码是非ascii的,其中一些标记为已知,而其他不是。
但是如果仅仅使用latin1或UTF-8,并且所有未知的
编码都是ascii,则结果应该可以。在将来,我们将为您检查
,如果一切正常,避免此警告。棘手的部分是
在不影响ascii-only情况下的性能。

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown
encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.



d.tst$Nr <- iconv(d.tst$Nr, "LATIN1", "UTF-8")
d.tst2$Nr2 <- iconv(d.tst2$Nr2, "LATIN1", "UTF-8")

Encoding(d.tst$Nr)
# [1] "UTF-8" "UTF-8"
Encoding(d.tst2$Nr2)
# [1] "UTF-8" "UTF-8"

a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")




警告消息:
bmerge(i,x,leftcols,rightcols,io,xo,roll,rollends,nomatch,
已知编码
data.table比较当前的字节,因此不支持 mixed
编码;即,使用latin1和UTF-8,或者如果任何未知的

编码是非ascii,其中一些编码是已知的,而其他的则不是。
但是如果使用latin1或UTF-8,未知
编码是ascii,那么结果应该是确定的。在将来,我们将为您检查
,如果一切正常,避免此警告。棘手的部分是
在不影响ascii-only情况下的性能。

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,: A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown
encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.



d.tst$Nr <- iconv(d.tst$Nr, "UTF-8", "cp1252")
d.tst2$Nr2 <- iconv(d.tst2$Nr2, "UTF-8", "cp1252")

Encoding(d.tst$Nr)
# [1] "unknown" "unknown"
Encoding(d.tst2$Nr2)
# [1] "unknown" "unknown"

a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

# Here we change the encoding on only one data.table

d.tst$Nr <- iconv(d.tst$Nr, "cp1252", "UTF-8")

#Check encoding
Encoding(d.tst$Nr)
# [1] "UTF-8" "UTF-8"
Encoding(d.tst2$Nr2)
# [1] "unknown" "unknown"

# Logical comparison
d.tst[1]$Nr == d.tst2[1]$Nr2
# [1] TRUE

# This merge fails completely, not just a warning, even if logic says they are the same
a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")




警告讯息:
bmerge(i,x,leftcols,rightcols,io,xo,roll ,rollends,nomatch,
在连接列中检测到已知的编码(latin1或UTF-8)。
data.table比较当前的字节,因此不支持 mixed
编码;即使用latin1和UTF-8,或者如果任何未知的

编码是非ascii并且其中一些标记为已知,而其他未标记为已知。
但是如果仅仅使用latin1或UTF-8,并且所有未知的
编码都是ascii,那么结果应该是确定的。在将来,我们将为您检查
,如果一切正常,避免此警告。棘手的部分是
在不影响ascii-only情况下的性能。

Warning message: In bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown
encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.



sessionInfo() 

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Swedish_Sweden.1252  LC_CTYPE=Swedish_Sweden.1252    LC_MONETARY=Swedish_Sweden.1252 LC_NUMERIC=C                    
[5] LC_TIME=Swedish_Sweden.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6 RODBC_1.3-13    

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.1.2       assertthat_0.1 DBI_0.4-1      tools_3.3.1    tibble_1.1     Rcpp_0.12.5    chron_2.3-47


推荐答案

在新的data.table版本1.9.8中,这应该是固定的。

As of the new data.table version 1.9.8 this should be fixed.

例如:

# This merge fails completely, not just a warning, even if logic says they are the same
a <- merge(d.tst, d.tst2, all.x=TRUE, by.x = "Nr", by.y = "Nr2")

代码为我失败(给我的系统设置)在1.9.6。从1.9.8它的工作原理应该。

The above code failed for me (given my sys-settings) in 1.9.6. As of 1.9.8 it works as it should.

所以这应该解决了。

这篇关于Data.table,逻辑比较和编码在非英语环境中的错误/错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆