为什么“字符通常是优选的因素”在data.table中的key? [英] why "character is often preferred to factor" in data.table for key?

查看:86
本文介绍了为什么“字符通常是优选的因素”在data.table中的key?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table 手动:


其实我们喜欢data.table包含一个计数排序
使用R的内部全局字符串
缓存的字符向量的算法。这对于包含许多
重复的字符向量特别快,例如键列中的分组数据。这意味着
字符通常是首选的因子。因素仍然完全支持
,特别是有序因子(其中水平不是以
的字母顺序)。

In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R’s internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).

不是因素只是整数应该更容易做计数排序字符

Isn't factor just integer which should be easier to do counting sort than character?

推荐答案


计数排序
比字符?

Isn't factor just integer which should be easier to do counting sort than character?

是的,如果你已经有一个因素。但是创建这个因素的时间可能很重要,这就是 setkey (和特殊的通过)旨在打败。尝试在随机排序的字符向量上计时 factor(),比如说1e6长的1e4级别。然后在原始随机排序的字符向量上比较 setkey 或特别

Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey (and ad hoc by) aim to beat. Try timing factor() on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey or ad hoc by on the original randomly ordered character vector.

agstudy的评论也是正确的;即字符向量(指向R个高速缓存字符串的指针)与因素非常相似。在32位系统上,字符向量与因子的整数向量具有相同的大小,但该因子具有用于存储(并且有时也是复制)的级别属性。在64位系统上,指针是两倍大。但另一方面,R的字符串缓存可以直接从字符向量指针查找,而因子有一个额外的跳跃通过级别。 (levels属性也是R字符串缓存指针的字符向量。)

agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)

这篇关于为什么“字符通常是优选的因素”在data.table中的key?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆