计算跨特定行(R)的数据框内的相似度 [英] Calculate similarity within a dataframe across specific rows (R)

查看:53
本文介绍了计算跨特定行(R)的数据框内的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据框:

I have a dataframe that looks something like this:

df <- data.frame("index" = 1:10, "title" = c("Sherlock","Peaky Blinders","Eastenders","BBC News", "Antiques Roadshow","Eastenders","BBC News","Casualty", "Dragons Den","Peaky Blinders"), "date" = c("01/01/20","01/01/20","01/01/20","01/01/20","01/01/20","02/01/20","02/01/20","02/01/20","02/01/20","02/01/20"))

输出看起来像这样:

Index  Title              Date
1      Sherlock           01/01/20
2      Peaky Blinders     01/01/20
3      Eastenders         01/01/20
4      BBC News           01/01/20
5      Antiques Roadshow  01/01/20
6      Eastenders         02/01/20
7      BBC News           02/01/20
8      Casualty           02/01/20
9      Dragons Den        02/01/20
10     Peaky Blinders     02/01/20

我希望能够确定标题在不同日期出现的次数.在上面的示例中,"BBC新闻","Peaky Blinders"和"Eastenders"都出现在01/01/20和02/01/20上.因此,两个日期之间的相似度为60%(两个日期中5个标题中的3个相同).

I want to be able to determine the number of times that a title appears on different dates. In the example above, "BBC News", "Peaky Blinders" and "Eastenders" all appear on 01/01/20 and 02/01/20. The similarity between the two dates is therefore 60% (3 out of 5 titles are identical across both dates).

值得一提的是,实际的数据框要大得多,每天有120个标题,跨度约700天.我需要将每个日期"的标题"与上一个日期"进行比较,然后计算它们的相似度.为了清楚起见,我需要确定01/01/20与02/01/20、02/01/20与03/01/20、03/01/20与04/01/20的相似性,依此类推上...

It's probably also worth mentioning that the actual dataframe is much larger, and has 120 titles per day, and spans some 700 days. I need to compare the "titles" of each "date" with the previous "date" and then calculate their similarity. So to be clear, I need to determine the similarity of 01/01/20 with 02/01/20, 02/01/20 with 03/01/20, 03/01/20 with 04/01/20, and so on...

有人知道我该怎么做吗?我最终的目标是使用Tableau可视化一段时间内的相似性/差异,但是我担心这样的计算对于该特定软件而言太复杂了,我将不得不以某种方式将其添加到实际数据本身中.

Does anyone have any idea how I might go about doing this? My eventual aim is to use Tableau to visualise similarity/difference over time, but I fear that such a calculation would be too complicated for that particular software and I'll have to somehow add it into the actual data itself.

推荐答案

这里是另一种可能性.您可以创建一个简单的函数来计算组之间的相似度或其他索引.然后,按日期将数据框划分为一个列表,然后对列表中的每个函数进行 lapply 自定义函数(最终结果将是一个列表).

Here is another possibility. You can create a simple function to calculate the similarity or other index between groups. Then, split your data frame by date into a list, and lapply the custom function to each in the list (final result will be a list).

calc_similar <- function(i) {
  sum(s[[i]] %in% s[[i-1]])/length(s[[i-1]])
}

s <- split(df$title, df$date)

setNames(lapply(seq_along(s)[-1], calc_similar), names(s)[-1])

输出

$`2020-01-02`
[1] 0.6

这篇关于计算跨特定行(R)的数据框内的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆