如何使用 Tableau 在我的数据中找到最常见的(字符串)序列? [英] How can I find most common sequences (of strings) in my data using Tableau?

查看:84
本文介绍了如何使用 Tableau 在我的数据中找到最常见的(字符串)序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常的免责声明:在 Tableau(以及 R,这是我首选的数据整理语言)方面,我是一个新手.

这是我想要做的:

我有一个包含多个变量的数据集,其中两个是时间";和流派".以下是数据外观的示例:

<预><代码>索引 标题 日期 流派 时间1 夏洛克 01/01/20 戏剧 21:002 Peaky Blinders 01/01/20 戏剧 20:003 东方人 01/01/20 戏剧 19:304 BBC 新闻 01/01/20 新闻 18:305 古董路演 01/01/20 实况 18:006 Peaky Blinders 02/01/20 戏剧 21:007 伤亡 02/01/20 戏剧 20:008 东方人 02/01/20 戏剧 19:309 BBC 新闻 02/01/20 新闻 18:3010 龙穴 02/01/20 娱乐 18:00

这只是来自一个非常大的数据集的一个非常小的样本,但我想确定的是:最常见的流派组合/序列是什么?例如,在上面的数据中,最常见的三个序列是戏剧+戏剧+戏剧".最常见的四个顺序是新闻+戏剧+戏剧+戏剧".

我的数据有数千个日期(这是 BBC One 的广播时间表,以防您想知道),我想找出最常见的类型组合/序列是什么(至少 3 个).

我想知道这对 Tableau 来说是否太复杂,而我需要在 R 中做些什么?任何建议将是最受欢迎的!和往常一样,我很乐意详细说明任何不清楚的地方.

解决方案

由于 Tableau 方法会偏离主题,让我们考虑一下 R:

我们可以使用滚动函数来确定3的所有序列.zoo包有一个rollapply函数:

图书馆(动物园)rollapply(data$Genre,3,c)# [,1] [,2] [,3]#[1,]戏剧"《戏剧》《戏剧》#[2,]戏剧"《戏剧》《新闻》#[3,]戏剧"《新闻》事实"#[4,] 新闻"事实"《戏剧》#[5,] 事实"《戏剧》《戏剧》#[6,]戏剧"《戏剧》《戏剧》#[7,]戏剧"《戏剧》《新闻》#[8,]戏剧"《新闻》《娱乐》

有很多方法可以从这里开始,但我更喜欢 dplyr:

库(dplyr)rollapply(data$Genre,3,c)%>%as_tibble() %>%group_by_all() %>%相符()# V1 V2 V3 n# <chr><chr><chr><int>#1 戏剧 戏剧 2#2 戏剧 戏剧新闻 2#3 戏剧新闻娱乐 1#4 戏剧新闻事实 1#5 事实戏剧剧 1#6 新闻事实剧 1

数据:

data <- structure(list(Index = 1:10, Title = c(Sherlock", Peaky Blinders",Eastenders"、BBC 新闻"、古董路演"、Peaky Blinders"、伤亡"、东方人"、BBC 新闻"、龙之巢穴"),日期 = c(01/01/20",01/01/20"、01/01/20"、01/01/20"、01/01/20"、02/01/20"、02/01"/20",02/01/20",02/01/20",02/01/20"),流派 = c(戏剧",戏剧",戏剧"、新闻"、事实"、戏剧"、戏剧"、戏剧"、新闻"、娱乐"),时间 = c(21:00",20:00",19:30",18:30",18:00"、21:00"、20:00"、19:30"、18:30"、18:00")),类=数据".框架", row.names = c(NA,-10L))

Usual disclaimer: I'm very much a novice when it comes to Tableau (and R, which is my preferred data wrangling language).

Here's what I'm trying to do:

I have a dataset which has multiple variables, two of which are "time" and "genre". Here's an example of what the data looks like:


Index      Title              Date       Genre             Time
    1      Sherlock           01/01/20   Drama             21:00
    2      Peaky Blinders     01/01/20   Drama             20:00
    3      Eastenders         01/01/20   Drama             19:30
    4      BBC News           01/01/20   News              18:30
    5      Antiques Roadshow  01/01/20   Factual           18:00
    6      Peaky Blinders     02/01/20   Drama             21:00
    7      Casualty           02/01/20   Drama             20:00
    8      Eastenders         02/01/20   Drama             19:30
    9      BBC News           02/01/20   News              18:30
   10      Dragons Den        02/01/20   Entertainment     18:00

This is just a very small sample from a very large dataset, but what I'm trying to determine is: what are the most common combination/sequence of genres? For example, in the data above, the most common sequence of three would be "drama + drama + drama". The most common sequence of four would be "news + drama + drama + drama".

My data has thousands of dates (it's the BBC One broadcast schedule in case you were wondering) and I want to find out what the most common combination / sequence of genres are (of at least 3).

I wonder if this is too complex for Tableau and something I would need to do in R instead? Any advice would be most welcome! And as always, I'm happy to elaborate on anything that isn't clear.

解决方案

Since a Tableau approach would be off topic, let's consider R:

We can use a rolling function to determine all sequences of 3. The zoo package has a rollapply function:

library(zoo)
rollapply(data$Genre,3,c)
#     [,1]      [,2]      [,3]           
#[1,] "Drama"   "Drama"   "Drama"        
#[2,] "Drama"   "Drama"   "News"         
#[3,] "Drama"   "News"    "Factual"      
#[4,] "News"    "Factual" "Drama"        
#[5,] "Factual" "Drama"   "Drama"        
#[6,] "Drama"   "Drama"   "Drama"        
#[7,] "Drama"   "Drama"   "News"         
#[8,] "Drama"   "News"    "Entertainment"

There are plenty of ways to go from here, but I prefer dplyr:

library(dplyr)
rollapply(data$Genre,3,c) %>%
   as_tibble() %>%
   group_by_all() %>%
   tally()
#  V1      V2      V3                n
#  <chr>   <chr>   <chr>         <int>
#1 Drama   Drama   Drama             2
#2 Drama   Drama   News              2
#3 Drama   News    Entertainment     1
#4 Drama   News    Factual           1
#5 Factual Drama   Drama             1
#6 News    Factual Drama             1

Data:

data <- structure(list(Index = 1:10, Title = c("Sherlock", "Peaky Blinders", 
"Eastenders", "BBC News", "Antiques Roadshow", "Peaky Blinders", 
"Casualty", "Eastenders", "BBC News", "Dragons Den"), Date = c("01/01/20", 
"01/01/20", "01/01/20", "01/01/20", "01/01/20", "02/01/20", "02/01/20", 
"02/01/20", "02/01/20", "02/01/20"), Genre = c("Drama", "Drama", 
"Drama", "News", "Factual", "Drama", "Drama", "Drama", "News", 
"Entertainment"), Time = c("21:00", "20:00", "19:30", "18:30", 
"18:00", "21:00", "20:00", "19:30", "18:30", "18:00")), class = "data.frame", row.names = c(NA, 
-10L))

这篇关于如何使用 Tableau 在我的数据中找到最常见的(字符串)序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆