我的 group by 似乎没有在磁盘框架中工作 [英] My group by doesn't appear to be working in disk frames

查看:25
本文介绍了我的 group by 似乎没有在磁盘框架中工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个大型数据集 (>20GB) 上运行了一个 group by,但它似乎工作不正常

I ran a group by on a large dataset (>20GB) and it doesn't appear to be working quite right

这是我的代码

mydf[, .(value = n_distinct(list_of_id, na.rm = T)),
                      by = .(week),
                      keep = c("list_of_id", "week")
                      ] 

它返回了这个错误

警告消息:1:在序列化(数据,节点$con)中:
'package:MLmetrics' 可能在加载 2 时不可用:在serialize(data, node$con) : 'package:MLmetrics' 可能不可用加载 3 时:在 serialize(data, node$con) 中:'package:MLmetrics'加载时可能不可用 4: In serialize(data, node$con) :
加载 5 时,package:MLmetrics"可能不可用:在serialize(data, node$con) : 'package:MLmetrics' 可能不可用加载 6 时:在 serialize(data, node$con) 中:'package:MLmetrics'加载时可能不可用 7: In serialize(data, node$con) :
'package:MLmetrics' 可能在加载 8 时不可用:在serialize(data, node$con) : 'package:MLmetrics' 可能不可用加载时

Warning messages: 1: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 2: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 3: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 4: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 5: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 6: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 7: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 8: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading

我最初加载了库,但在运行此代码之前我运行了 remove.packages(MLmetrics).此外,我检查了 conflicted::conflict_scout 并没有与包 MLmetrics 出现任何冲突.

I had initially loaded the library but then I ran remove.packages(MLmetrics) before running this code. Additionally, I checked conflicted::conflict_scout and there aren't any conflicts that show up with the package MLmetrics.

当我运行这段代码时

> mydf %>% 
+   filter(week == "2012-01-02")

它给了我这个输出

         week    value 
1: 2012-01-02      483     
2: 2012-01-02     61233  

我担心它在对数据进行分组时出现问题,因为它没有创建价值周的不同组.两列都存储为数据类型字符.

I'm concerned that something went wrong when it was grouping the data since it didn't create distinct groups of the value week. Both columns are stored as data types character.

推荐答案

这里是{disk.frame}的作者.

Author of {disk.frame} here.

问题是目前,{disk.frame} 并没有按 within 每个块进行分组.它不像 dplyr 语法那样进行全局分组.

The issue is that currently, {disk.frame} doesn't the group by within each chunk. It does not do group-by globally like how dplyr syntax would do.

所以你必须再次总结它才能达到你想要的.所以我建议现在坚持使用 dplyr 语法.

So you have to summarise it again to achieve what you want. So I suggest sticking with the dplyr syntax for now.

正如@Waldi 指出的那样,{disk.frame} 的 dplyr 语法工作正常,目前缺少对 data.table 的支持,因此您现在只能使用 dplyr 语法实现您想要的.

As @Waldi pointed out, {disk.frame}'s dplyr syntax works fine, and currently support for data.table is lacking so you can only achieve what you want with dplyr syntax for now.

{disk.frame} 需要实现 https://github.com/xiaodaigh/disk.frame/issues/239 在它适用于 data.table 之前.

{disk.frame} needs to implement https://github.com/xiaodaigh/disk.frame/issues/239 before it will work for data.table.

如果有人/组织愿意资助此功能的开发,请私信我.

Please DM me if anyone/organization would like to fund the development of this feature.

这篇关于我的 group by 似乎没有在磁盘框架中工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆