Python:使用单词交集但不使用字符交集的Jaccard距离 [英] Python: Jaccard Distance using word intersection but not character intersection

查看:124
本文介绍了Python:使用单词交集但不使用字符交集的Jaccard距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我没有意识到Python set函数实际上将字符串分成单个字符.我为Jaccard编写了python函数,并使用了python交集方法.我将两个集合传递给此方法,然后在将两个集合传递给我的jaccard函数之前,先在setring上使用set函数.

I didn't realize the that Python set function actually separating string into individual characters. I wrote python function for Jaccard and used python intersection method. I passed two sets into this method and before passing the two sets into my jaccard function I use the set function on the setring.

示例:假设我有字符串NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg,我将调用set(NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg),它将字符串分隔为字符.因此,当我将其发送到jaccard函数交叉点时,实际上看起来是字符交叉点,而不是单词到单词的交叉点.我该怎么做单词间的交集.

example: assume I have string NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg i would call set(NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg) which will separate string into characters. So when I send it to jaccard function intersection actually look character intersection instead of word to word intersection. How can I do word to word intersection.

#implementing jaccard
def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

如果我没有在字符串NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg上调用set函数,则会出现以下错误:

if I don't call set function on my string NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg I get the following error:

    c = a.intersection(b)
AttributeError: 'str' object has no attribute 'intersection'

我想代替单词到字符的交集,而是想做单词到单词的交集,并获得jaccard的相似性.

Instead of character to character intersection I want to do word to word intersection and get the jaccard similarity.

推荐答案

尝试首先将字符串拆分为单词:

Try splitting your string into words first:

word_set = set(your_string.split())

示例:

>>> word_set = set("NEW Fujifilm 16MP 5x".split())
>>> character_set = set("NEW Fujifilm 16MP 5x")
>>> word_set
set(['NEW', '16MP', '5x', 'Fujifilm'])
>>> character_set
set([' ', 'f', 'E', 'F', 'i', 'M', 'j', 'm', 'l', 'N', '1', 'P', 'u', 'x', 'W', '6', '5'])

这篇关于Python:使用单词交集但不使用字符交集的Jaccard距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆