我如何比较短语相似? [英] How do I compare phrases for similarity?

查看:162
本文介绍了我如何比较短语相似?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在与它认为可能涉及相同主题的问题的列表中输入一个问题,计算器presents你。我看到其他网站上或在其他程序中,太(帮助文件系统,例如)类似的功能,但我从来没有设定这样的事情我自己。现在我很好奇,想知道什么样的算法之一,将用于这一点。

When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in other programs, too (Help file systems, for example), but I've never programmed something like this myself. Now I'm curious to know what sort of algorithm one would use for that.

这在我脑海的第一个方法是分裂的短语,单词并查找包含这些单词短语。在此之前,你可能要丢掉微不足道的词(如的,A,做等),然后你将要对结果进行排序。

The first approach that comes to my mind is splitting the phrase into words and look for phrases containing these words. Before you do that, you probably want to throw away insignificant words (like 'the', 'a', 'does' etc), and then you will want to rank the results.

嘿,等等 - 让我们做到这一点的网页,然后我们就可以有一个... watchamacallit ...... - 搜索引擎,然后就可以卖广告,然后......

Hey, wait - let's do that for web pages, and then we can have a ... watchamacallit ... - a "search engine", and then we can sell ads, and then ...

没有,认真,有什么办法解决这个问题的方式吗?

No, seriously, what are the common ways to solve this problem?

推荐答案

一种方法是袋装的,也就是说所谓的模型。

One approach is the so called bag-of-words model.

正如你猜到了,首先你算的话有多少次出现在文本(通常被称为在NLP - 行话文件)。然后你扔出来的所谓停用词,如,A,或等。

As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.

您留下了一个字和​​字计数。这样做了一段时间,你会得到一个COM prehensive组出现在你的文件的话。然后,您可以创建一个索引的这些话: 土豚是1,苹果是2,...,Z-指数为70092.

You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words: "aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.

现在你可以把你的话袋,把他们变成载体。例如,如果您的文档包含土豚,没有别的两个引用,它是这样的:

Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:

[2 0 0 ... 70k zeroes ... 0].

在此,你可以指望的两个向量之间的角度与点积。角度越小,越接近文档

After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.

这是一个简单的版本,并没有其他更先进的技术。愿维基百科与您

This is a simple version and there other more advanced techniques. May the Wikipedia be with you.

这篇关于我如何比较短语相似?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆