按对计算行相似百分比并将其添加为新的列 [英] Calculate row similarity percentage pair wise and add it as a new colum

查看：46 发布时间：2021/4/29 18:54:39 r datatable tidyverse similarity

本文介绍了按对计算行相似百分比并将其添加为新的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个类似此示例的日期框架，我想查找相似的行(而不是重复的行)并按明智的方式计算相似度.我发现

解决方案

使用函数输出"，将其命名为 sim .消除自我比较，然后将最大相似性行按row_1分组:

  sim = sim％>％filter(row_1！= row_2)％&％;％group_by(row_1)％>％切片(which.max(相似性))

然后您可以将它们添加到原始数据中:

  df％>％mutate(row_1 = 1:n())％>％left_join(sim)

row_2 列给出了最相似的行的行号，而 similarity 给出了其相似度得分.(您可能需要改进这些列名称.)

I have a date frame like this sample, I would like to find similar rows (not duplicate) and calculate similarity per wise. I find this solution but i would like to keep all my columns and add similarity percentage as a new variable. My aim is to find records with highest similarity percentage. How could I do it ?

sample data set

df <- tibble::tribble(
     ~date, ~user_id, ~Station_id, ~location_id, ~ind_id, ~start_hour, ~start_minute, ~start_second, ~end_hour, ~end_minute, ~end_second, ~duration_min,
  20191015, 19900234,         242,            2,    "ac",           7,            25,             0,         7,          30,          59,             6,
  20191015, 19900234,         242,            2,    "ac",           7,            31,             0,         7,          32,          59,             2,
  20191015, 19900234,         242,            2,    "ac",           7,            33,             0,         7,          38,          59,             6,
  20191015, 19900234,         242,            2,    "ac",           7,            39,             0,         7,          40,          59,             2,
  20191015, 19900234,         242,            2,    "ac",           7,            41,             0,         7,          43,          59,             3,
  20191015, 19900234,         242,            2,    "ac",           7,            44,             0,         7,          45,          59,             2,
  20191015, 19900234,         242,            2,    "ac",           7,            47,             0,         7,          59,          59,            13,
  20191015, 19900234,         242,            2,    "ad",           7,            47,             0,         7,          59,          59,            13,
  20191015, 19900234,         242,            2,    "ac",           8,             5,             0,         8,           6,          59,             2,
  20191015, 19900234,         242,            2,    "ad",           8,             5,             0,         8,           6,          59,             2,
  20191015, 19900234,         242,            2,    "ac",           8,             7,             0,         8,           8,          59,             2,
  20191015, 19900234,         242,            2,    "ad",           8,             7,             0,         8,           8,          59,             2,
  20191015, 19900234,         242,            2,    "ac",          16,            26,             0,        16,          55,          59,            30,
  20191015, 19900234,         242,            2,    "ad",          16,            26,             0,        16,          55,          59,            30,
  20191015, 19900234,         242,            2,    "ad",          17,             5,             0,        17,           6,          59,             2,
  20191015, 19900234,         242,            2,    "ac",          17,             5,             0,        17,          23,          59,            19,
  20191015, 19900234,         242,            2,    "ad",          17,             7,             0,        17,          15,          59,             9,
  20191015, 19900234,         242,            2,    "ad",          17,            16,             0,        17,          22,          59,             7,
  20191015, 19900234,         264,            2,    "ac",          17,            24,             0,        17,          35,          59,            12,
  20191015, 19900234,         264,            2,    "ad",          17,            25,             0,        17,          35,          59,            11,
  20191016, 19900234,         242,            1,    "ac",           7,            12,             0,         7,          14,          59,             3,
  20191016, 19900234,         242,            1,    "ad",           7,            13,             0,         7,          13,          59,             1,
  20191016, 19900234,         242,            1,    "ac",          17,            45,             0,        17,          49,          59,             5,
  20191016, 19900234,         242,            1,    "ad",          17,            46,             0,        17,          48,          59,             3,
  20191016, 19900234,         242,            2,    "ad",           7,            14,             0,         8,           0,          59,            47,
  20191016, 19900234,         242,            2,    "ac",           7,            15,             0,         8,           0,          59,            47
  )

Function for comparing rows

row_cf <- function(x, y, df){
  sum(df[x,] == df[y,])/ncol(df)
}

Function output

# 1) Create all possible row combinations
# 2) Rename 
# 3) Run through each row
# 4) Calculate similarity

expand.grid(1:nrow(df), 1:nrow(df)) %>% 
  rename(row_1 = Var1, row_2 = Var2) %>% 
  rowwise() %>% 
  mutate(similarity = row_cf(row_1, row_2, df))


# A tibble: 676 x 3
   row_1 row_2 similarity
   <int> <int>      <dbl>
 1     1     1      1    
 2     2     1      0.75 
 3     3     1      0.833
 4     4     1      0.75 
 5     5     1      0.75 
 6     6     1      0.75 
 7     7     1      0.75 
 8     8     1      0.667
 9     9     1      0.583
10    10     1      0.5

Edit: I would like to find similar rows in the data like here

解决方案

Using your "function output", call it sim. Eliminate the self-comparisons and then keep the max similarity row grouped by row_1:

sim = sim %>% 
  filter(row_1 != row_2) %>%
  group_by(row_1) %>% 
  slice(which.max(similarity))

Then you can add these to your original data:

df %>% mutate(row_1 = 1:n()) %>%
  left_join(sim)

The row_2 column gives the row number of the most similar row, and similarity gives its similarity score. (You may want to improve these column names.)

这篇关于按对计算行相似百分比并将其添加为新的列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

按对计算行相似百分比并将其添加为新的列 [英] Calculate row similarity percentage pair wise and add it as a new colum

问题描述

sample data set

Function for comparing rows

Function output

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

按对计算行相似百分比并将其添加为新的列 [英] Calculate row similarity percentage pair wise and add it as a new colum

问题描述

sample data set

Function for comparing rows

Function output

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭