如何规避 table() 函数抛出的 2^31 错误? [英] How can I circumvent the 2^31 error thrown by the table() function?

查看:47
本文介绍了如何规避 table() 函数抛出的 2^31 错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我真的尽我最大的努力通过 stackoverflow 搜索解决方案,但不幸的是我找不到合适的问题.因此,我必须自己提出一个问题.

I really tried my best searching through stackoverflow for a solution but unfortunatelly I couldn't find a suitable question. Therefore, I have to raise a question on my own.

我正在处理一个包含 sessionID 和主题的数据集.我想知道,有多少特定主题的物品被一起购买.值得庆幸的是,堆栈溢出成员有一个好主意,它使用了 table() 函数和 crossprod() 函数的组合.

I'm working with a data set containing sessionID's and topics. I wanted to find out, how many items of specific topics have been purchased together. Thankfully, a stack overflow member had a great idea, using a combination of the table() function and the crossprod() function.

topicPairs <- crossprod(table(as.data.frame(transactions)))

你可以在这里查看:我如何计算,在一个会话中总共有多少个项目?

对于主题(或流派),这种方法非常有效,最终矩阵在存储使用方面非常小.

For the topics (or genres) this approach worked really well and the final matrix was really small in terms of storage usage.

但是,现在我想知道,在不同的会话中总共购买了多少艺术家.因此,我只是用艺术家(这里,我有 35727 个)替换流派(我有 360 个)并应用这种表格交叉组合".不幸的是,R 抛出以下错误消息:

However, now I want to find out, how many artists have been purchased together in different sessions. Therefore, I just replace the genres (I have 360 of them) with the artists (here, I have 35727) and apply this 'table-crossprod-combination'. Unfortunately, R throws the following error message:

attempt to make a table with >= 2^31 elements          

我也明白发生了什么:表格函数为每个会话和类型生成一个条目.由于我只有 360 种不同的流派,所以这没有问题,因为会话数乘以 gernes 数小于 2^31.另一方面,我有 35727 位不同的艺术家.如果我将此数字乘以会话数,我将超过 2^31 个元素的数量.

I also understood, what happened: The table function generates one entry per session and genre. Since I only have 360 different genres, this is no problem because the number of sessions multiplied by the number of gernes is less than 2^31. On the other hand, I have 35727 different artists. If I multiply this number by the number of sessions I exceed the number of 2^31 elements.

这真的很可悲,因为这个解决方案非常聪明和简单,而且效果很好.所以,我想问你,是否有办法绕过这个问题.当然,我的数据集相当大……但有人使用更大的数据集.

This is actually really sad, since the solution is so smart and easy and it worked really well. Therefore, I want to ask you, if there is a way to circument this problem. Sure, my datasset is quite big ... but there are people using much bigger data sets.

Perheps,我必须将设置的数据拆分为较小的子集,并在最后一步将它们合并在一起.但这并不容易,因为有一些艺术家出现,例如在子集 1 中但不在子集 2 中.因此,我不能简单地按元素添加矩阵.

Perheps, I have to split the data set up in smaller subsets and merge them together in a final step. But this is not that easy, since there are some artists which appear e.g. in subset 1 but not in subset 2. Therefore, I cannot simply add the matrices elementwise.

如果你能为这个问题提供一个解决方案,那就太棒了,因为它让我发疯,离完美的解决方案很近.

It would be awesome, if you could provide a solution for this problem since it drives me crazy, beeing that close to the perfect solution.

在此先非常感谢您!

推荐答案

当您的结果矩阵很可能是稀疏的,因为零的百分比很高,如果可能的话,使用稀疏矩阵来节省空间是值得的.

When your results matrix is likely to be sparse, in that there is a high percentage of zeros, it is worth using sparse matrices to save space, if possible.

所以对于您的 数据:

sessionID <- c(1, 2, 2, 3, 4, 4, 5, 6, 6, 6)
topic <- c("rock", "house", "country", "rock", "r'n'b", "pop", "classic", "house", "rock", "country")
transactions <- cbind(sessionID, topic)

您可以使用xtabs返回一个稀疏矩阵(而不是table返回的密集矩阵),并使用Matrix包来找到这个的叉积,这将保留稀疏性.

You can use xtabs to return a sparse matrix (instead of the dense matrix returned by table), and use the Matrix package to find the crossproduct of this and which will retain the sparsity.

tab <- xtabs(~ sessionID + topic, data=transactions, sparse=TRUE)
Matrix::crossprod(tab)

这篇关于如何规避 table() 函数抛出的 2^31 错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆