使用python nltk查找两个网页之间的相似性? [英] using python nltk to find similarity between two web pages?

查看:122
本文介绍了使用python nltk查找两个网页之间的相似性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想查找两个网页是否相似.有人可以建议具有wordnet相似性功能的python nltk有用吗?在这种情况下,最佳的相似度函数是什么?

I want to find whether two web pages are similar or not. Can someone suggest if python nltk with wordnet similarity functions helpful and how? What is the best similarity function to be used in this case?

推荐答案

spotsigs 论文 joyceschan 提到的内容重复检测技术,其中包含很多值得深思的地方.

The spotsigs paper mentioned by joyceschan addresses content duplication detection and it contains plenty of food for thought.

如果您想对关键术语进行快速比较,则nltk标准功能可能就足够了.

If you are looking for a quick comparison of key terms, nltk standard functions might suffice.

使用nltk,您可以通过查找 WordNet

With nltk you can pull synonyms of your terms by looking up the synsets contained by WordNet

>>> from nltk.corpus import wordnet

>>> wordnet.synsets('donation')
[Synset('contribution.n.02'), Synset('contribution.n.03')]

>>> wordnet.synsets('donations')
[Synset('contribution.n.02'), Synset('contribution.n.03')]

它理解复数,并且还告诉您同义词对应的词性

It understands plurals and it also tells you which part of speech the synonym corresponds to

同义词存储在树中,叶中有更多特定术语,根部有更多通用术语.根术语称为 hypernyms

Synsets are stored in a tree with more specific terms at the leaves and more general ones at the root. The root terms are called hypernyms

您可以通过术语与常见的 hypernym

You can measure similarity by how close the terms are to the common hypernym

请注意词性的不同部分,根据NLTK食谱,它们没有重叠的路径,因此您不应尝试测量它们之间的相似性.

说,您有两个术语 donation gift ,可以从synsets获取它们,但在本示例中,我直接对其进行了初始化:

Say, you have two terms donation and gift, you can get them from synsets but in this example I initialized them directly:

>>> d = wordnet.synset('donation.n.01')
>>> g = wordnet.synset('gift.n.01')

食谱建议使用Wu-Palmer相似方法

The cookbook recommends Wu-Palmer Similarity method

>>> d.wup_similarity(g)
0.93333333333333335

此方法为您提供一种快速的方法来确定所使用的术语是否与相关概念相对应.看一看使用Python进行自然语言处理,看看您还可以做什么来帮助您进行文本分析.

This approach gives you a quick way to determine if the terms used correspond to related concepts. Take a look at Natural Language Processing with Python to see what else you can do to help your analysis of text.

这篇关于使用python nltk查找两个网页之间的相似性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆