设置键时data.table如何对字符串排序 [英] How data.table sorts strings when setting key

查看:61
本文介绍了设置键时data.table如何对字符串排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

昨天,我不得不花一些时间尝试在代码中查找错误,我发现 data.table 包对字符串进行排序的方式与base有点不同。这是正常现象吗?最有效的方法(具有 data.table 的好处)是重现通过基本订单函数?这是一个玩具可重现的示例:

Yesterday I had to spend some time trying to find a bug in my code and I found that data.table package sorts strings in a way a bit different from base. Is this a normal behavior and what is the most efficient way (which has benefits of data.table) to reproduce results obtained with base order function? Here is a toy reproducible example:

library(data.table)
options(stringsAsFactors = FALSE)

d <- data.frame(cn=c("USA","Ubuntu","Uzbekistan"))
d[order(d$cn),,drop=F]

#          cn
#2     Ubuntu
#1        USA
#3 Uzbekistan

dt <- data.table(d)
setkey(dt, cn)
dt

#           cn
#1:        USA
#2:     Ubuntu
#3: Uzbekistan

options(stringsAsFactors = default.stringsAsFactors())

OS Windows 7

OS Windows 7

推荐答案

2014年3月更新

对此存在一些争论一。从v1.9.2开始,我们现在已经决定使用C语言环境对 setkey 进行排序;例如,无论用户的语言环境如何,所有大写字母都位于所有小写字母之前。这是在v1.8.8中所做的更改,我们打算撤消此操作,但目前仍坚持使用。

There's been some debate about this one. As of v1.9.2 we've settled for now on setkey sorting using C locale; e.g., all capital letters come before all lower case letters, regardless of user's locale. This was a change made in v1.8.8 which we had intended to reverse but have stuck with for now.

考虑 save()-在您的区域设置中键入一个键表,而同事 load()-在另一个区域设置中将其插入。当他们加入该表时,如果它是语言环境排序顺序,则可能不再正常工作。如果 setkey 是否允许再次进行语言环境排序,我们必须仔细考虑一下,可能是通过保存语言环境名称和 sorted属性,所以 data.table 至少可以比较和检测当前语言环境是否与运行 setkey 的语言环境不同。

Consider save()-ing a keyed table in your locale and a colleague load()-ing it in a different locale. When they join to that table it may no longer work correctly if it were locale sort order. We have to think a bit more carefully if setkey is to allow locale ordering again, probably by saving the locale name along with the "sorted" attribute, so data.table can at least compare and detect if the current locale is different to the one that ran setkey.

这也是出于速度原因,因为根据语言环境排序比C语言环境慢得多。虽然,我们可以尽可能高效地执行此操作,并且可以选择是否允许它是理想的选择。

It's also for speed reasons as sorting according to locale is much slower than C locale. Although, we can do it as efficiently as possible and allowing it optionally would be ideal.

因此,这是一项功能要求,欢迎进一步评论。

Hence, this is now a feature request and further comments are very welcome.

FR#4842 setkey使用会话的语言环境而不是C语言环境进行排序

不错!调用 setkey 依次调用 setkeyv 并调用 fastorder 排序依次调用 chorder 的列/条目。

Nice catch! The call to setkey in turn calls setkeyv and that calls fastorder to "order" the columns/entries that in turn calls chorder.

chorder 依次调用C函数 Ccountingcharacter.c 。现在,这里我想问题是由于语言环境引起的。

chorder in turn calls a C function Ccountingcharacter.c. Now, here I suppose the problem comes due to "locale".

让我们看看我的Mac上使用的是什么语言环境。

Let's see what "locale" I'm on my mac.

Sys.getLocale()
# [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

现在让我们看看订单如何对其进行排序:

Now let's see how order sorts it:

x <- c("USA", "Ubuntu", "Uzbekistan")
order(x)
# [1] 2 1 3

现在,让我们将语言环境更改为 C。

Now, let's change the "locale" to "C".

Sys.setlocale("LC_ALL", "C")
# [1] "C/C/C/C/C/en_US.UTF-8"

order(x)
# [1] 1 2 3

来自?订单


字符向量的排序顺序将取决于使用的语言环境的整理顺序:请参见比较

来自?比较


字符向量中的字符串比较是按使用的语言环境的整理顺序在字符串内进行字典化的:请参见语言环境。诸如en_US之类的语言环境的整理顺序通常与C(应使用ASCII)不同,并且可能令人惊讶。注意不要对整理顺序做任何假设:在爱沙尼亚语中,Z介于S和T之间,并且排序规则不一定是逐个字符的-在丹麦语aa中,排序为单个字母,在z之后。....

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z....

因此,基本上,订单以及 C语言环境下的订单与 data.table的订单相同 setkey 。我的猜测是 chorder 调用的C函数会自动在C语言环境中运行,该语言环境会比较 S在 b之前的ascii值。

So, basically, order as well under "C" locale, gives the same order as data.table's setkey. My guess is that the C-function called by chorder automatically runs on C-locale which will compare ascii values for which "S" comes before "b".

将其引起@MatthewDowle注意(如果他尚未意识到)可能很重要。因此,我建议您将此文件记录为错误在这里(请确定)。

It's probably important to bring this to @MatthewDowle's attention (if he's not already aware of it). So, I'd suggest that you file this as a bug here (just to be sure).

这篇关于设置键时data.table如何对字符串排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆