按对计算行相似百分比并将其添加为新的列 [英] Calculate row similarity percentage pair wise and add it as a new colum
问题描述
我有一个类似此示例的日期框架,我想查找相似的行(而不是重复的行)并按明智的方式计算相似度.我发现
使用函数输出",将其命名为 sim
.消除自我比较,然后将最大相似性行按row_1分组:
sim = sim%>%filter(row_1!= row_2)%&%;%group_by(row_1)%>%切片(which.max(相似性))
然后您可以将它们添加到原始数据中:
df%>%mutate(row_1 = 1:n())%>%left_join(sim)
row_2
列给出了最相似的行的行号,而 similarity
给出了其相似度得分.(您可能需要改进这些列名称.)
I have a date frame like this sample, I would like to find similar rows (not duplicate) and calculate similarity per wise. I find this solution but i would like to keep all my columns and add similarity percentage as a new variable. My aim is to find records with highest similarity percentage. How could I do it ?
sample data set
df <- tibble::tribble(
~date, ~user_id, ~Station_id, ~location_id, ~ind_id, ~start_hour, ~start_minute, ~start_second, ~end_hour, ~end_minute, ~end_second, ~duration_min,
20191015, 19900234, 242, 2, "ac", 7, 25, 0, 7, 30, 59, 6,
20191015, 19900234, 242, 2, "ac", 7, 31, 0, 7, 32, 59, 2,
20191015, 19900234, 242, 2, "ac", 7, 33, 0, 7, 38, 59, 6,
20191015, 19900234, 242, 2, "ac", 7, 39, 0, 7, 40, 59, 2,
20191015, 19900234, 242, 2, "ac", 7, 41, 0, 7, 43, 59, 3,
20191015, 19900234, 242, 2, "ac", 7, 44, 0, 7, 45, 59, 2,
20191015, 19900234, 242, 2, "ac", 7, 47, 0, 7, 59, 59, 13,
20191015, 19900234, 242, 2, "ad", 7, 47, 0, 7, 59, 59, 13,
20191015, 19900234, 242, 2, "ac", 8, 5, 0, 8, 6, 59, 2,
20191015, 19900234, 242, 2, "ad", 8, 5, 0, 8, 6, 59, 2,
20191015, 19900234, 242, 2, "ac", 8, 7, 0, 8, 8, 59, 2,
20191015, 19900234, 242, 2, "ad", 8, 7, 0, 8, 8, 59, 2,
20191015, 19900234, 242, 2, "ac", 16, 26, 0, 16, 55, 59, 30,
20191015, 19900234, 242, 2, "ad", 16, 26, 0, 16, 55, 59, 30,
20191015, 19900234, 242, 2, "ad", 17, 5, 0, 17, 6, 59, 2,
20191015, 19900234, 242, 2, "ac", 17, 5, 0, 17, 23, 59, 19,
20191015, 19900234, 242, 2, "ad", 17, 7, 0, 17, 15, 59, 9,
20191015, 19900234, 242, 2, "ad", 17, 16, 0, 17, 22, 59, 7,
20191015, 19900234, 264, 2, "ac", 17, 24, 0, 17, 35, 59, 12,
20191015, 19900234, 264, 2, "ad", 17, 25, 0, 17, 35, 59, 11,
20191016, 19900234, 242, 1, "ac", 7, 12, 0, 7, 14, 59, 3,
20191016, 19900234, 242, 1, "ad", 7, 13, 0, 7, 13, 59, 1,
20191016, 19900234, 242, 1, "ac", 17, 45, 0, 17, 49, 59, 5,
20191016, 19900234, 242, 1, "ad", 17, 46, 0, 17, 48, 59, 3,
20191016, 19900234, 242, 2, "ad", 7, 14, 0, 8, 0, 59, 47,
20191016, 19900234, 242, 2, "ac", 7, 15, 0, 8, 0, 59, 47
)
Function for comparing rows
row_cf <- function(x, y, df){
sum(df[x,] == df[y,])/ncol(df)
}
Function output
# 1) Create all possible row combinations
# 2) Rename
# 3) Run through each row
# 4) Calculate similarity
expand.grid(1:nrow(df), 1:nrow(df)) %>%
rename(row_1 = Var1, row_2 = Var2) %>%
rowwise() %>%
mutate(similarity = row_cf(row_1, row_2, df))
# A tibble: 676 x 3
row_1 row_2 similarity
<int> <int> <dbl>
1 1 1 1
2 2 1 0.75
3 3 1 0.833
4 4 1 0.75
5 5 1 0.75
6 6 1 0.75
7 7 1 0.75
8 8 1 0.667
9 9 1 0.583
10 10 1 0.5
Edit: I would like to find similar rows in the data like here
Using your "function output", call it sim
. Eliminate the self-comparisons and then keep the max similarity row grouped by row_1:
sim = sim %>%
filter(row_1 != row_2) %>%
group_by(row_1) %>%
slice(which.max(similarity))
Then you can add these to your original data:
df %>% mutate(row_1 = 1:n()) %>%
left_join(sim)
The row_2
column gives the row number of the most similar row, and similarity
gives its similarity score. (You may want to improve these column names.)
这篇关于按对计算行相似百分比并将其添加为新的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!