使用NLP和Elasticsearch进行语义搜索 [英] Semantic search with NLP and elasticsearch

查看:292
本文介绍了使用NLP和Elasticsearch进行语义搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将Elasticsearch用作搜索服务器,我的任务是构建语义"搜索功能.系统应从诸如我有爆管"之类的简短短语中推断出用户正在搜索管道工,并返回所有在elasticsearch中索引的管道工.

I am experimenting with elasticsearch as a search server and my task is to build a "semantic" search functionality. From a short text phrase like "I have a burst pipe" the system should infer that the user is searching for a plumber and return all plumbers indexed in elasticsearch.

可以直接在诸如Elasticsearch之类的搜索服务器中完成此操作,还是必须使用自然语言处理(NLP)工具,例如毛伊岛索引器.我手头的任务,文本分类的确切术语是什么?尽管给定的文本很短,因为它是一个搜索短语.

Can that be done directly in a search server like elasticsearch or do I have to use a natural language processing (NLP) tool like e.g. Maui Indexer. What is the exact terminology for my task at hand, text classification? Though the given text is very short as it is a search phrase.

推荐答案

可能有几种方法具有不同的实现复杂性.

There may be several approaches with different implementation complexity.

最简单的方法是创建主题列表(例如管道),附加一词袋(例如管道"),按大多数关键字标识搜索请求并仅在指定主题中进行搜索(您可以在弹性搜索文档中添加字段topic,并在搜索过程中使用+将其设置为必填).

The easiest one is to create list of topics (like plumbing), attach bag of words (like "pipe"), identify search request by majority of keywords and search only in specified topic (you can add field topic to your elastic search documents and set it as mandatory with + during search).

当然,如果您有很多文档,则手动创建主题列表和单词袋非常耗时.您可以使用机器学习来自动执行某些任务.基本上,在单词和/或文档之间进行距离测量就足以自动发现主题(例如,通过数据聚类)并分类查询这些主题之一.混合使用这些技术也是一个不错的选择(例如,您可以手动创建主题并为其分配初始文档,但可以使用分类进行查询分配).请参阅Wikipedia在 潜在语义分析 上的文章,以更好地理解这个想法.另请注意数据聚类毛伊岛索引器可能会以这种方式成为好帮手工具.

Of course, if you have lots of documents, manual creation of topic list and bag of words is very time expensive. You can use machine learning to automate some of tasks. Basically, it is enough to have distance measure between words and/or documents to automatically discover topics (e.g. by data clustering) and classify query to one of these topics. Mix of these techniques may also be a good choice (for example, you can manually create topics and assign initial documents to them, but use classification for query assignment). Take a look at Wikipedia's article on latent semantic analysis to better understand the idea. Also pay attention to the 2 linked articles on data clustering and document classification. And yes, Maui Indexer may become good helper tool this way.

最后,您可以尝试构建一个引擎,该引擎理解"该短语的含义(而不仅仅是使用术语频率)并搜索适当的主题.这很可能涉及自然语言处理基于本体的知识库.但实际上,该领域仍在积极研究中,如果没有以前的经验,那么您将很难实现这样的目标.

Finally, you can try to build an engine that "understands" meaning of the phrase (not just uses terms frequency) and searches appropriate topics. Most probably, this will involve natural language processing and ontology-based knowledgebases. But in fact, this field is still in active research and without previous experience it will be very hard for you to implement something like this.

这篇关于使用NLP和Elasticsearch进行语义搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆