为什么“性格往往比因素更受欢迎"?在 data.table 中作为键? [英] why "character is often preferred to factor" in data.table for key?

查看:14
本文介绍了为什么“性格往往比因素更受欢迎"?在 data.table 中作为键?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自 data.table 手册:

事实上我们非常喜欢 data.table 包含一个计数排序使用 R 的内部全局字符串的字符向量算法缓存.这对于包含许多字符的字符向量特别快重复项,例如键列中的分组数据.这意味着性格往往比因素更受欢迎.因素仍充分支持,特别是有序因子(水平不在字母顺序).

In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R’s internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).

factor 不是应该比 character 更容易counting sort 的整数吗?

Isn't factor just integer which should be easier to do counting sort than character?

推荐答案

不只是整数,它应该更容易进行计数排序比字符?

Isn't factor just integer which should be easier to do counting sort than character?

是的,如果您已经获得了一个因素.但是创建该因素的时间可能很重要,这就是 setkey(和临时 by)的目标.尝试在随机排序的字符向量上计时 factor(),例如 1e6 长和 1e4 级别.然后在原始随机排序的字符向量上与 setkey 或 ad hoc by 进行比较.

Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey (and ad hoc by) aim to beat. Try timing factor() on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey or ad hoc by on the original randomly ordered character vector.

agstudy 的评论也是正确的;即,字符向量(作为指向 R 缓存字符串的指针)无论如何都与因子非常相似.在 32 位系统上,字符向量与因子的整数向量大小相同,但因子也具有要存储(有时是复制)的级别属性.在 64 位系统上,指针是两倍大.但另一方面,R 的字符串缓存可以直接从字符向量指针中查找,而因子通过级别有额外的跃点.(levels 属性也是 R 字符串缓存指针的字符向量.)

agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)

这篇关于为什么“性格往往比因素更受欢迎"?在 data.table 中作为键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆