使用 NLP 和 elasticsearch 进行语义搜索 [英] Semantic search with NLP and elasticsearch

查看:43
本文介绍了使用 NLP 和 elasticsearch 进行语义搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 elasticsearch 作为搜索服务器,我的任务是构建语义"搜索功能.从像我有一个爆管"这样的短文本短语,系统应该推断用户正在搜索管道工并返回所有在 elasticsearch 中索引的管道工.

I am experimenting with elasticsearch as a search server and my task is to build a "semantic" search functionality. From a short text phrase like "I have a burst pipe" the system should infer that the user is searching for a plumber and return all plumbers indexed in elasticsearch.

这可以直接在诸如 elasticsearch 之类的搜索服务器中完成,还是我必须使用自然语言处理 (NLP) 工具,例如毛伊岛索引器.我手头的任务(文本分类)的确切术语是什么?虽然给定的文本很短,因为它是一个搜索词组.

Can that be done directly in a search server like elasticsearch or do I have to use a natural language processing (NLP) tool like e.g. Maui Indexer. What is the exact terminology for my task at hand, text classification? Though the given text is very short as it is a search phrase.

推荐答案

可能有多种实现复杂度不同的方法.

There may be several approaches with different implementation complexity.

最简单的方法是创建主题列表(如管道),附加词袋(如管道"),通过大多数关键字识别搜索请求并仅在指定主题中搜索(您可以将字段 topic 添加到您的弹性搜索文档中,并在搜索过程中使用 + 将其设置为必填项).

The easiest one is to create list of topics (like plumbing), attach bag of words (like "pipe"), identify search request by majority of keywords and search only in specified topic (you can add field topic to your elastic search documents and set it as mandatory with + during search).

当然,如果您有大量文档,手动创建主题列表和词袋是非常耗时的.您可以使用机器学习来自动执行某些任务.基本上,在单词和/或文档之间进行距离度量就足以自动发现主题(例如通过数据聚类)并将查询分类到这些主题之一.混合使用这些技术也可能是一个不错的选择(例如,您可以手动创建主题并为其分配初始文档,但使用分类进行查询分配).查看维基百科关于潜在语义分析的文章,以更好地理解这个想法.还要注意关于数据聚类文档分类.是的,Maui Indexer 可以通过这种方式成为很好的辅助工具.

Of course, if you have lots of documents, manual creation of topic list and bag of words is very time expensive. You can use machine learning to automate some of tasks. Basically, it is enough to have distance measure between words and/or documents to automatically discover topics (e.g. by data clustering) and classify query to one of these topics. Mix of these techniques may also be a good choice (for example, you can manually create topics and assign initial documents to them, but use classification for query assignment). Take a look at Wikipedia's article on latent semantic analysis to better understand the idea. Also pay attention to the 2 linked articles on data clustering and document classification. And yes, Maui Indexer may become good helper tool this way.

最后,您可以尝试构建一个理解"短语含义的引擎(不仅仅是使用词频)并搜索适当的主题.这很可能涉及自然语言处理基于本体的知识库.但实际上,这个领域还在积极研究中,没有经验,你很难实现这样的东西.

Finally, you can try to build an engine that "understands" meaning of the phrase (not just uses terms frequency) and searches appropriate topics. Most probably, this will involve natural language processing and ontology-based knowledgebases. But in fact, this field is still in active research and without previous experience it will be very hard for you to implement something like this.

这篇关于使用 NLP 和 elasticsearch 进行语义搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆