生成所有可能的对并计算R中的频率 [英] Generate all possible pairs and count frequency in R
问题描述
我有一个在不同类别(食品和食品)中不同地点(城市)销售的产品(苹果,梨,香蕉)的数据框。
I have a data frame of products (apple, pear, banana) sold across different locations (cities) within different categories (food and edibles).
我会想要计算任何给定的产品对在任何类别中一起出现的次数。
I would like to count how many times any given pair of products appeared together in any category.
这是示例数据集,我正在尝试使其工作:
This is an example dataset I'm trying to make this to work on:
category <- c('food','food','food','food','food','food','edibles','edibles','edibles','edibles', 'edibles')
location <- c('houston, TX', 'houston, TX', 'las vegas, NV', 'las vegas, NV', 'philadelphia, PA', 'philadelphia, PA', 'austin, TX', 'austin, TX', 'charlotte, NC', 'charlotte, NC', 'charlotte, NC')
item <- c('apple', 'banana', 'apple', 'pear', 'apple', 'pear', 'pear', 'apple', 'apple', 'pear', 'banana')
food_data <- data.frame(cbind(category, location, item), stringsAsFactors = FALSE)
例如,一对苹果和香蕉一起出现在内华达州拉斯维加斯中的食品类别,也位于北卡罗莱纳州夏洛特中的食品类别中。因此,苹果和香蕉对的计数为2。
For example, the pair "apple & banana" appeared together in the "food" category in "las vegas, NV", but also in the "edibles" category in "charlotte, NC". Therefore, the count for the "apple & banana" pair would be 2.
我想要的输出是像这样的对计数:
My desired output is count of pairs like this:
(无序)数量的 apple&香蕉
2
(无序)计数为苹果&梨
4
任何人都知道如何完成这个?对R来说相对较新,并且已经混淆了一段时间。
Anyone have an idea for how to accomplish this? Relatively new to R and have been confused for a while.
我正试图用它来计算不同项目之间的亲和力。
I'm trying to use this to calculate affinities between different items.
输出的其他说明:
我的完整数据集包含数百个不同的项目。想要获得一个数据帧,其中第一列是该对,第二列是每对的计数。
Additional clarification on output: My full dataset consists of hundreds of different items. Would like to get a data frame where the first column is the pair and the second column is the count for each pair.
推荐答案
这是使用 tidyverse
和 crossprod
的一种方法;通过使用 spread
,它会将所有 item / fruit 从同一类别-位置组合转换为与 item 在一起的一行作为标头(这要求您在每个类别国家中没有重复的 item ,否则,您需要进行预汇总步骤),这些值指示存在; crossprod
本质上评估成对的 items 列的内积,并给出共现次数。
Here is one way using tidyverse
and crossprod
; By using spread
, it turns all item/fruit from the same category-location combination into one row with the item as headers (this requires you have no duplicated item in each category-country, otherwise you need a pre-aggregation step), values indicating existence; crossprod
essentially evaluates the inner product of pairs of items columns and gives the number of cooccurrences.
library(tidyverse)
food_data %>%
mutate(n = 1) %>%
spread(item, n, fill=0) %>%
select(-category, -location) %>%
{crossprod(as.matrix(.))} %>%
`diag<-`(0)
# apple banana pear
#apple 0 2 4
#banana 2 0 1
#pear 4 1 0
要将其转换为数据框:
To convert this to a data frame:
food_data %>%
mutate(n = 1) %>%
spread(item, n, fill=0) %>%
select(-category, -location) %>%
{crossprod(as.matrix(.))} %>%
replace(lower.tri(., diag=T), NA) %>%
reshape2::melt(na.rm=T) %>%
unite('Pair', c('Var1', 'Var2'), sep=", ")
# Pair value
#4 apple, banana 2
#7 apple, pear 4
#8 banana, pear 1
这篇关于生成所有可能的对并计算R中的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!