生成所有可能的对并计算R中的频率 [英] Generate all possible pairs and count frequency in R

查看:98
本文介绍了生成所有可能的对并计算R中的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在不同类别(食品和食品)中不同地点(城市)销售的产品(苹果,梨,香蕉)的数据框。

I have a data frame of products (apple, pear, banana) sold across different locations (cities) within different categories (food and edibles).

我会想要计算任何给定的产品对在任何类别中一起出现的次数。

I would like to count how many times any given pair of products appeared together in any category.

这是示例数据集,我正在尝试使其工作:

This is an example dataset I'm trying to make this to work on:

category <- c('food','food','food','food','food','food','edibles','edibles','edibles','edibles', 'edibles')
location <- c('houston, TX', 'houston, TX', 'las vegas, NV', 'las vegas, NV', 'philadelphia, PA', 'philadelphia, PA', 'austin, TX', 'austin, TX', 'charlotte, NC', 'charlotte, NC', 'charlotte, NC')
item <- c('apple', 'banana', 'apple', 'pear', 'apple', 'pear', 'pear', 'apple', 'apple', 'pear', 'banana')

food_data <- data.frame(cbind(category, location, item), stringsAsFactors = FALSE)

例如,一对苹果和香蕉一起出现在内华达州拉斯维加斯中的食品类别,也位于北卡罗莱纳州夏洛特中的食品类别中。因此,苹果和香蕉对的计数为2。

For example, the pair "apple & banana" appeared together in the "food" category in "las vegas, NV", but also in the "edibles" category in "charlotte, NC". Therefore, the count for the "apple & banana" pair would be 2.

我想要的输出是像这样的对计数:

My desired output is count of pairs like this:

(无序)数量的 apple&香蕉

2

(无序)计数为苹果&梨

4

任何人都知道如何完成这个?对R来说相对较新,并且已经混淆了一段时间。

Anyone have an idea for how to accomplish this? Relatively new to R and have been confused for a while.

我正试图用它来计算不同项目之间的亲和力。

I'm trying to use this to calculate affinities between different items.

输出的其他说明:
我的完整数据集包含数百个不同的项目。想要获得一个数据帧,其中第一列是该对,第二列是每对的计数。

Additional clarification on output: My full dataset consists of hundreds of different items. Would like to get a data frame where the first column is the pair and the second column is the count for each pair.

推荐答案

这是使用 tidyverse crossprod 的一种方法;通过使用 spread ,它会将所有 item / fruit 从同一类别-位置组合转换为与 item 在一起的一行作为标头(这要求您在每个类别国家中没有重复的 item ,否则,您需要进行预汇总步骤),这些值指示存在; crossprod 本质上评估成对的 items 列的内积,并给出共现次数。

Here is one way using tidyverse and crossprod; By using spread, it turns all item/fruit from the same category-location combination into one row with the item as headers (this requires you have no duplicated item in each category-country, otherwise you need a pre-aggregation step), values indicating existence; crossprod essentially evaluates the inner product of pairs of items columns and gives the number of cooccurrences.

library(tidyverse)
food_data %>% 
    mutate(n = 1) %>% 
    spread(item, n, fill=0) %>% 
    select(-category, -location) %>% 
    {crossprod(as.matrix(.))} %>% 
    `diag<-`(0)

#       apple banana pear
#apple      0      2    4
#banana     2      0    1
#pear       4      1    0






要将其转换为数据框:


To convert this to a data frame:

food_data %>% 
    mutate(n = 1) %>% 
    spread(item, n, fill=0) %>% 
    select(-category, -location) %>% 
    {crossprod(as.matrix(.))} %>% 
    replace(lower.tri(., diag=T), NA) %>%
    reshape2::melt(na.rm=T) %>%
    unite('Pair', c('Var1', 'Var2'), sep=", ")

#           Pair value
#4 apple, banana     2
#7   apple, pear     4
#8  banana, pear     1

这篇关于生成所有可能的对并计算R中的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆