我如何计算,在一个会话中总共有多少个项目? [英] How can I count, how many Items have been in one session together?
问题描述
我真的尽我最大的努力通过 stackoverflow 搜索解决方案,但不幸的是我找不到合适的问题.因此,我必须自己提出一个问题.
I really tried my best searching through stackoverflow for a solution but unfortunatelly I couldn't find a suitable question. Therefore, I have to raise a question on my own.
我正在处理一个包含 sessionID 和主题的数据集.想象一下它看起来像这样:
I'm working with a data set containing sessionID's and topics. Imagine it looking like this:
sessionID <- c(1, 2, 2, 3, 4, 4, 5, 6, 6, 6)
topic <- c("rock", "house", "country", "rock", "r'n'b", "pop", "classic", "house", "rock", "country")
transactions <- cbind(sessionID, topic)
transactions
现在,我想知道某个主题的多少项目一起出现在一个会话中.最后,我想获得一个矩阵,表示特定主题与其他主题进行会话的频率.最终结果应如下所示:
Now, I want to find out, how many items of a certain topic have been in a session together. In the end, I want to gain a matrix, representing how often a specific topic has been in a session with the other topics. The final result should look like following:
topics <- sort(unique(topic))
topicPairs <- matrix(NA, nrow = length(topics), ncol = length(topics))
colnames(topicPairs) <- topics
rownames(topicPairs) <- topics
topicPairs["house", "country"] <- 2
topicPairs["country", "house"] <- 2
topicPairs["r'n'b", "pop"] <- 1
topicPairs["pop", "r'n'b"] <- 1
topicPairs["rock", "house"] <- 1
topicPairs["house", "rock"] <- 1
topicPairs["rock", "country"] <- 1
topicPairs["country", "rock"] <- 1
topicPairs["house", "house"] <- 2
topicPairs
例如,房子"行,国家"列应该等于 2,因为house"一直和国家"在一起在第 2 节和第 6 节中.
For example, in row "house", column "country" should equal 2, since "house" has been together with "country" in sessions 2 and 6.
在我期望的主要对角线上,一个主题在会话中总共出现的频率.在这里,排房子"专栏房子"等于 2,因为它已经在两个会话中......但我不确定这一点.
On the main diagonal I would expect, how often one topic would have been in sessions in total. Here, row "house" column "house" equals 2 since it has been in two sessions ... but I'm not sure about this.
如果您的解决方案不包含循环,那就太棒了,因为我的数据集非常大.因此,我更喜欢 tidyverse 中的函数(dplyr、tidyr 等).也许是 group_by 和 tidyr 包中的 spread 函数的组合.
It would be awesome, if your solution wouldn't include loops since my data set is quite big. Therefore, I would prefer functions from the tidyverse (dplyr, tidyr, etc.). Perhaps a combination of group_by and the spread function from the tidyr package.
我真的在寻找你的答案.预先非常感谢您!
I'm really looking for your answers. Thank you very much in advance!
亲切的问候!
推荐答案
如果你不介意通过 join
(transactions
到它自己)>dplyr 包,以下应该可以工作:
If you don't mind performing a join
(of transactions
to itself) via the dplyr
package, the following should work:
library(dplyr)
library(tibble)
library(tidyr)
# ...
# Your existing code that created `transactions`.
# ...
# Convert transactions to a dataframe for transformation.
transactions <- as.data.frame(transactions)
result <- transactions %>%
# Create pairings of topics by session.
inner_join(transactions, by = "sessionID", suffix = c(".r", ".c")) %>%
# "Pivot" the pairings, such that each topic within `topics.c` gets its own
# column; and then aggregate the pairings by count.
pivot_wider(id_cols = c(sessionID, topic.r),
names_from = topic.c,
values_from = sessionID,
values_fn = length,
names_sort = TRUE) %>%
# Sort appropriately, to align the main diagonal.
arrange(topic.r) %>%
# Convert to matrix form, with topics as row names.
column_to_rownames(var = "topic.r") %>% as.matrix()
# View result.
result
这是我的result
的打印输出:
classic country house pop r'n'b rock
classic 1 NA NA NA NA NA
country NA 2 2 NA NA 1
house NA 2 2 NA NA 1
pop NA NA NA 1 1 NA
r'n'b NA NA NA 1 1 NA
rock NA 1 1 NA NA 3
更新
Ben 的建议更优雅(更不用说更聪明了),只需要以下内容
Update
The suggestion by Ben is more elegant (not to mention cleverer), and requires only the following
# ...
# Your existing code that created `transactions`.
# ...
# Compute the results.
result <- crossprod(table(as.data.frame(transactions)))
# Substitute NAs for 0s, if you so desire.
result <- ifelse(result == 0, NA, result)
达到同样的效果.我无法保证任一解决方案在大型数据集上的相对性能.
to achieve the same result. I cannot vouch for the relative performance of either solution on large datasets.
这篇关于我如何计算,在一个会话中总共有多少个项目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!