我的分组依据似乎没有在磁盘框架中工作 [英] My group by doesn't appear to be working in disk frames

查看:58
本文介绍了我的分组依据似乎没有在磁盘框架中工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个较大的数据集(> 20GB)上进行了分组,但似乎运行不正常

I ran a group by on a large dataset (>20GB) and it doesn't appear to be working quite right

这是我的代码

mydf[, .(value = n_distinct(list_of_id, na.rm = T)),
                      by = .(week),
                      keep = c("list_of_id", "week")
                      ] 

返回此错误


警告消息:1:在serialize(data,node $ con)中:

'package:MLmetrics'在加载时可能不可用2:在
serialize(data,node $ con):'package:MLmetrics'在加载时可能不可用
3:在serialize(data,node $ con):'package:MLmetrics'
可能加载4时不可用:在serialize(data,node $ con)中:

'package:MLmetrics'加载5时不可用:在
中serialize(data,node $ con ):加载6时'package:MLmetrics'可能不可用
:在serialize(data,node $ con)中:'package:MLmetrics'
在加载7时可能不可用:在serialize(data,node $ con)中:

'package:MLmetrics'在加载8时可能不可用:在
中serialize(data,node $ con):'package:MLmetrics'在加载时可能不可用

Warning messages: 1: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 2: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 3: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 4: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 5: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 6: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading 7: In serialize(data, node$con) :
'package:MLmetrics' may not be available when loading 8: In serialize(data, node$con) : 'package:MLmetrics' may not be available when loading

我最初是加载库的,但是后来我删除了。在运行此代码之前先打包(MLmetrics)。此外,我检查了flicted :: conflict_scout,并且MLmetrics包没有显示任何冲突。

I had initially loaded the library but then I ran remove.packages(MLmetrics) before running this code. Additionally, I checked conflicted::conflict_scout and there aren't any conflicts that show up with the package MLmetrics.

运行此代码时

> mydf %>% 
+   filter(week == "2012-01-02")

它给我这个输出

         week    value 
1: 2012-01-02      483     
2: 2012-01-02     61233  

我担心在分组数据时出了点问题,因为它没有创建值的不同组周。两列都存储为字符数据类型。

I'm concerned that something went wrong when it was grouping the data since it didn't create distinct groups of the value week. Both columns are stored as data types character.

推荐答案

此处是{disk.frame}的作者。

Author of {disk.frame} here.

问题在于,当前,{disk.frame}不在每个块的内。它不会像dplyr语法那样在全局范围内进行分组。

The issue is that currently, {disk.frame} doesn't the group by within each chunk. It does not do group-by globally like how dplyr syntax would do.

因此,您必须再次对其进行汇总才能实现所需的功能。因此,我建议暂时使用dplyr语法。

So you have to summarise it again to achieve what you want. So I suggest sticking with the dplyr syntax for now.

正如@Waldi指出的那样, {disk.frame} 的dplyr语法可以正常工作,并且目前不支持data.table,因此您现在只能使用dplyr语法实现。

As @Waldi pointed out, {disk.frame}'s dplyr syntax works fine, and currently support for data.table is lacking so you can only achieve what you want with dplyr syntax for now.

{disk.frame}需要实现 https://github.com/xiaodaigh/disk.frame/issues/239 才能生效

{disk.frame} needs to implement https://github.com/xiaodaigh/disk.frame/issues/239 before it will work for data.table.

如果有人/组织想为该功能的开发提供资金,请DM我。

Please DM me if anyone/organization would like to fund the development of this feature.

这篇关于我的分组依据似乎没有在磁盘框架中工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆