stringdist包中的Jaccard相似度匹配字符串中的单词 [英] Jaccard similarity in stringdist package to match words in character string

查看:43
本文介绍了stringdist包中的Jaccard相似度匹配字符串中的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 stringdist 函数中使用 Jaccard 相似度来确定词袋的相似度.据我所知,使用 Jaccard 只能匹配字符串中的字母.

I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters within a character string.

c <- c('cat', 'dog', 'person')
d <- c('cat', 'dog', 'ufo')

stringdist(c, d, method='jaccard', q=2)
[1] 0 0 1

所以我们在这里看到它计算了cat"和cat"、dog"和dog"、person"和ufo"的相似度.

So we see here that it calculates the similarity of 'cat' and 'cat', 'dog' and 'dog' and 'person' and 'ufo'.

我还尝试将单词转换为 1 个长文本字符串.以下方法接近我所需要的,但它仍在计算 1 -(共享 2 克的数量/唯一 2 克的总数):

I also tried converting the words into 1 long text string. The following approaches what I need, but it's still calculating 1 - (number of shared 2-grams / number of total unique 2-grams):

f <- 'cat dog person'
g <- 'cat dog ufo'
stringdist(f, g, method='jaccard', q=2)
[1] 0.5625

我如何让它通过单词计算相似度?

How would I get it to calculate similarity by the words?

推荐答案

您可以先对句子进行标记化并散列相应的单词列表,将句子转换为整数列表,然后使用 seq_dist() 计算距离.

You can start by tokenizing the sentence and hashing the corresponding list of words to transform your sentences into list of integers, and then use seq_dist() to calculate the distance.

library(hashr); library(stringdist)
f <- 'cat dog person'
g <- 'cat dog ufo'
seq_dist(hash(strsplit(f, "\\s+")), hash(strsplit(g, "\\s+")), method = "jaccard", q = 2)
[1] 0.6666667

这篇关于stringdist包中的Jaccard相似度匹配字符串中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆