我该如何接近"您是不是要找&QUOT?;不使用谷歌? [英] How do I approximate "Did you mean?" without using Google?

查看:144
本文介绍了我该如何接近"您是不是要找&QUOT?;不使用谷歌?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这个问题的重复的:

I am aware of the duplicates of this question:

  • How does the Google "Did you mean?" Algorithm work?
  • How do you implement a "Did you mean"?
  • ... and many others.

这些问题有兴趣的算法如何实际工作。我的问题是更喜欢:假设谷歌不存在或可能这个功能是不存在的,我们没有用户输入。一个人如何去实现这个算法的近似版本?

These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?

为什么这很有趣?

确定。试着输入<一href="http://www.google.com/#sclient=psy&hl=en&site=&source=hp&q=qualfy&aq=f&aqi=&aql=&oq=&pbx=1&bav=on.2,or.&fp=170344a196d61403">qualfy"进入谷歌和它告诉你:

Ok. Try typing "qualfy" into Google and it tells you:

您是不是要找: 资格

不够公平。它采用统计机器学习从数十亿用户做这样收集的数据。但现在尝试键入此:<一href="http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=Trytoreconnectyou">Trytoreconnectyou"进入谷歌和它告诉你:

Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:

您是不是要找:尝试重新连接您

现在,这是更有趣的部分。谷歌是如何确定的?有一本字典方便和猜测最有可能的话,然后再使用用户输入?它如何拼写错误的单词和句子之间的区别?

Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?

现在考虑到大多数程序员不必从十亿用户接入输入,我在寻找实现这一算法,什么资源可用(数据集,图书馆等)的最佳近似的方式。有什么建议?

Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?

推荐答案

假设你有话(所有出现在最坏的情况下字典的话,所有出现在数据在系统中的短语的字典最好的情况下),而且你知道不同的字的相对频率,你应该能够合理地猜测哪些用户通过的的相似和命中为相似字的数目。权重显然需要一个比特的试验和错误的,但一般的用户会更感兴趣的一种流行的结果就是有点语言更远离它们比在一个有效字即语言更靠近输入的字符串,但只有一个或两个点击在您的系统。

Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.

第二种情况应该是有点更简单。你会发现一切开始字符串的有效字(T是无效的,TR是无效的,试就是一个字,Tryt不是一个单词,等等),并为每个有效的话,你重复算法对于其余的字符串。这应该是pretty的快速假设你的词典收录。如果你发现一个结果,你都能够分解长字符串转换为一组有效的话,没有剩余的字符,这就是你推荐。当然,如果你是谷歌,你可能修改算法来寻找子是相当接近错别字,以实际的话,你有一定的逻辑来处理,其中一个字符串可以读取多种方式与松散足够的拼写检查的情况下(可能使用结果数打破平局)。

The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).

这篇关于我该如何接近&QUOT;您是不是要找&QUOT?;不使用谷歌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆