如何使用 Tableau 在我的数据中找到最常见的(字符串)序列? [英] How can I find most common sequences (of strings) in my data using Tableau?
问题描述
通常的免责声明:在 Tableau(以及 R,这是我首选的数据整理语言)方面,我是一个新手.
这是我想要做的:
我有一个包含多个变量的数据集,其中两个是时间";和流派".以下是数据外观的示例:
<预><代码>索引 标题 日期 流派 时间1 夏洛克 01/01/20 戏剧 21:002 Peaky Blinders 01/01/20 戏剧 20:003 东方人 01/01/20 戏剧 19:304 BBC 新闻 01/01/20 新闻 18:305 古董路演 01/01/20 实况 18:006 Peaky Blinders 02/01/20 戏剧 21:007 伤亡 02/01/20 戏剧 20:008 东方人 02/01/20 戏剧 19:309 BBC 新闻 02/01/20 新闻 18:3010 龙穴 02/01/20 娱乐 18:00这只是来自一个非常大的数据集的一个非常小的样本,但我想确定的是:最常见的流派组合/序列是什么?例如,在上面的数据中,最常见的三个序列是戏剧+戏剧+戏剧".最常见的四个顺序是新闻+戏剧+戏剧+戏剧".
我的数据有数千个日期(这是 BBC One 的广播时间表,以防您想知道),我想找出最常见的类型组合/序列是什么(至少 3 个).
我想知道这对 Tableau 来说是否太复杂,而我需要在 R 中做些什么?任何建议将是最受欢迎的!和往常一样,我很乐意详细说明任何不清楚的地方.
由于 Tableau 方法会偏离主题,让我们考虑一下 R
:
我们可以使用滚动函数来确定3的所有序列.zoo
包有一个rollapply
函数:
图书馆(动物园)rollapply(data$Genre,3,c)# [,1] [,2] [,3]#[1,]戏剧"《戏剧》《戏剧》#[2,]戏剧"《戏剧》《新闻》#[3,]戏剧"《新闻》事实"#[4,] 新闻"事实"《戏剧》#[5,] 事实"《戏剧》《戏剧》#[6,]戏剧"《戏剧》《戏剧》#[7,]戏剧"《戏剧》《新闻》#[8,]戏剧"《新闻》《娱乐》
有很多方法可以从这里开始,但我更喜欢 dplyr
:
库(dplyr)rollapply(data$Genre,3,c)%>%as_tibble() %>%group_by_all() %>%相符()# V1 V2 V3 n# <chr><chr><chr><int>#1 戏剧 戏剧 2#2 戏剧 戏剧新闻 2#3 戏剧新闻娱乐 1#4 戏剧新闻事实 1#5 事实戏剧剧 1#6 新闻事实剧 1
数据:
data <- structure(list(Index = 1:10, Title = c(Sherlock", Peaky Blinders",Eastenders"、BBC 新闻"、古董路演"、Peaky Blinders"、伤亡"、东方人"、BBC 新闻"、龙之巢穴"),日期 = c(01/01/20",01/01/20"、01/01/20"、01/01/20"、01/01/20"、02/01/20"、02/01"/20",02/01/20",02/01/20",02/01/20"),流派 = c(戏剧",戏剧",戏剧"、新闻"、事实"、戏剧"、戏剧"、戏剧"、新闻"、娱乐"),时间 = c(21:00",20:00",19:30",18:30",18:00"、21:00"、20:00"、19:30"、18:30"、18:00")),类=数据".框架", row.names = c(NA,-10L))
Usual disclaimer: I'm very much a novice when it comes to Tableau (and R, which is my preferred data wrangling language).
Here's what I'm trying to do:
I have a dataset which has multiple variables, two of which are "time" and "genre". Here's an example of what the data looks like:
Index Title Date Genre Time
1 Sherlock 01/01/20 Drama 21:00
2 Peaky Blinders 01/01/20 Drama 20:00
3 Eastenders 01/01/20 Drama 19:30
4 BBC News 01/01/20 News 18:30
5 Antiques Roadshow 01/01/20 Factual 18:00
6 Peaky Blinders 02/01/20 Drama 21:00
7 Casualty 02/01/20 Drama 20:00
8 Eastenders 02/01/20 Drama 19:30
9 BBC News 02/01/20 News 18:30
10 Dragons Den 02/01/20 Entertainment 18:00
This is just a very small sample from a very large dataset, but what I'm trying to determine is: what are the most common combination/sequence of genres? For example, in the data above, the most common sequence of three would be "drama + drama + drama". The most common sequence of four would be "news + drama + drama + drama".
My data has thousands of dates (it's the BBC One broadcast schedule in case you were wondering) and I want to find out what the most common combination / sequence of genres are (of at least 3).
I wonder if this is too complex for Tableau and something I would need to do in R instead? Any advice would be most welcome! And as always, I'm happy to elaborate on anything that isn't clear.
Since a Tableau approach would be off topic, let's consider R
:
We can use a rolling function to determine all sequences of 3. The zoo
package has a rollapply
function:
library(zoo)
rollapply(data$Genre,3,c)
# [,1] [,2] [,3]
#[1,] "Drama" "Drama" "Drama"
#[2,] "Drama" "Drama" "News"
#[3,] "Drama" "News" "Factual"
#[4,] "News" "Factual" "Drama"
#[5,] "Factual" "Drama" "Drama"
#[6,] "Drama" "Drama" "Drama"
#[7,] "Drama" "Drama" "News"
#[8,] "Drama" "News" "Entertainment"
There are plenty of ways to go from here, but I prefer dplyr
:
library(dplyr)
rollapply(data$Genre,3,c) %>%
as_tibble() %>%
group_by_all() %>%
tally()
# V1 V2 V3 n
# <chr> <chr> <chr> <int>
#1 Drama Drama Drama 2
#2 Drama Drama News 2
#3 Drama News Entertainment 1
#4 Drama News Factual 1
#5 Factual Drama Drama 1
#6 News Factual Drama 1
Data:
data <- structure(list(Index = 1:10, Title = c("Sherlock", "Peaky Blinders",
"Eastenders", "BBC News", "Antiques Roadshow", "Peaky Blinders",
"Casualty", "Eastenders", "BBC News", "Dragons Den"), Date = c("01/01/20",
"01/01/20", "01/01/20", "01/01/20", "01/01/20", "02/01/20", "02/01/20",
"02/01/20", "02/01/20", "02/01/20"), Genre = c("Drama", "Drama",
"Drama", "News", "Factual", "Drama", "Drama", "Drama", "News",
"Entertainment"), Time = c("21:00", "20:00", "19:30", "18:30",
"18:00", "21:00", "20:00", "19:30", "18:30", "18:00")), class = "data.frame", row.names = c(NA,
-10L))
这篇关于如何使用 Tableau 在我的数据中找到最常见的(字符串)序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!