测试类似字符串内容 [英] Testing for similar string content

查看:134
本文介绍了测试类似字符串内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个机器人,将分析的帖子,并从数据库中隐约相关的字符串答复。我的目标不是一致,只是模糊的相似性,可以通过有人无知的话题(但足够的知识来尝试回复)。什么是一些方法,帮助我选择正确的答复?

I'm writing a bot that will analyse posts and reply with a vaguely related strings from a database. I'm not aiming for coherence, just for vague similarity that could pass as someone ignorant to the topic (but knowledgeable enough to try to reply). What are some methods that would help me to choose the right reply?

有一件事情我已经出来是创建一个词汇表中,检查列表的元素是在后,并从基于数据库的答复这些结果。该粗方法已经成功有关的时间(基于100答复随机职位)10%。我可能会扩大由更多的单词列表,但这种方法有其限制。 ?任何更好的

One thing I've come up with is to create a vocabulary list, check which elements of the list are in the post, and get a reply from the database based on these results. This crude method has been successful about 10% of the time (based on 100 replies to random posts). I might expand the list by more words, but this method has its limit. Any better ones?

(PS数据库是相当大的 - 大约500 000篇回复)

(P. S. The database is sizeable -- about 500 000 replies)

推荐答案

首先,我想你可以希望将大约50%的回答率,除非你准备编写大量的代码是最好的。

First of all, I think the best you can hope for will be about a 50% answer rate, unless you're prepared to write a lot of code.

如果你愿意让你的手脏一些统计数据,检查出的词频逆文档频率。基本上,你会用的生僻字的频率,以确定哪些关键字的文档的关键,并以此作为输入到TF-IDF算法拉出一个具有相同关键字的其他回复。

If you're willing to get your hands dirty with some statistics, check out term frequency–inverse document frequency. Basically, you will use the frequency of uncommon words to determine what keywords are critical to the document, and use this as the input into the tf-idf algorithm to pull out other replies with those same keywords.

您可以再进一步与白名单和黑名单技术来忽略常用词并优先考虑某些关键字结合这一点。然后,您可以继续调整的名单,你看到它的工作,以提高算法。

You can then combine this further with whitelisting and blacklisting techniques to ignore common words and prioritize certain keywords. You can then keep tuning those lists to enhance the algorithm as you see it work.

有你也可以用它来测试基本相似简单字符串指标。看看这个字符串指标列表。

There are also simpler string metrics you can use to test basic similarity. Take a look at this list of string metrics.

这篇关于测试类似字符串内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆