用弹性搜索分类文本 [英] categorize text with elasticsearch

查看:78
本文介绍了用弹性搜索分类文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



作为一个例子,让我们假设下面的输入文本:

我正在开发一个项目,其中涉及到从文本中识别不同的关键字 p>


这是布宜诺斯艾利斯关于肉类的一些文字的例子。


进一步假设我的弹性搜索实例具有以下文档:


城市:[巴塞罗那,布宜诺斯艾利斯,洛杉矶,...]



分类:[财务,政治,...]


我需要一种从输入文本中识别相应城市的方法和类别。



我的第一种方法是使用或运算符执行搜索查询,并查看哪个排名最高。之后,我还将用文本重新匹配匹配的文件,以确保这些文本真的在那里(换句话说,确保洛杉矶)匹配,因为洛杉矶一词在文本中,只有los或/ />

我想知道是否是使用Elasticsearch做这种事情的最佳做法。

解决方案

我会建议如下:




  • 使用 match_phrase 查询以确定术语 angeles 存在并以相同的顺序和位置存在

  • 将每个查询包装在一个< a href =http://www.elasticsearch.org/guide/reference/api/search/named-filters.html>命名筛选器,以便您可以确定哪些匹配。



例如,创建此文档:

  curl  - XPOST'http:// 1 27.0.0.1:9200/test/test?pretty=1'-d'
{
text:这是布宜诺斯艾利斯关于肉类的一些文本的例子
}
'

然后运行此查询查找布宜诺斯艾利斯洛杉矶

  curl -XGET' http://127.0.0.1:9200/test/test/_search?pretty=1'-d'
{
查询:{
constant_score:{
filter:{
or:[
{
fquery:{
_name:buenos_aires,
query b $ bmatch_phrase:{
text:布宜诺斯艾利斯
}
}
}
},
{
fquery:{
_name:los_angeles,
query:{
match_phrase:{
text:洛杉矶
}
}
}
}
]
}

}
}
'

#{
#hits:{
#hits:[
#{
#_source:{
#text:这是布宜诺斯艾利斯关于肉类的一些文本的例子
#},
# _score:1,
#_index:test,
#_id:JIwnN_FVTv-0i5YGrlHLeg,
#_type:test,
#matching_filters:[
#buenos_aires
#]
#}
#],
#max_score:1,
#总共:1
#},
#timed_out:false,
#_shards:{
#failed:0,
# :5,
# total:5
#},
#taken:58
#}

注意结果中的 matched_filters 元素,表示哪个过滤器匹配。


I am currently working on a Project which involves identifying different "keywords" out of a text.

As an example, lets assume the following input text:

"This is an example of some text written from Buenos Aires about Meat".

Further lets assume that my elasticsearch instance has following documents stored:

Cities: [Barcelona, Buenos Aires, Los Angeles, ...]

and

Categories: [finance, politics, ..]

I need a way to identify from the input text the corresponding city and category.

My first approach was to do a search query with "or" operator and see which one has the highest ranking. After that I will also rematch the matched documents with the text to ensure that these texts are really there (in other words to ensure that "los angeles" matches because the word "los angeles" is in the text and to only "los" or "angeles).

I am wondering if it a best practice way of doing this kind of things with Elasticsearch.

解决方案

I would suggest the following:

  • use match_phrase queries to identify that the terms los and angeles exist and exist in the same order and position
  • wrap each query in a named filter so that you can identify which ones matched.

For instance, create this document:

curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1'  -d '
{
   "text" : "This is an example of some text written from Buenos Aires about Meat"
}
'

Then run this query looking for Buenos Aires or Los Angeles:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "constant_score" : {
         "filter" : {
            "or" : [
               {
                  "fquery" : {
                     "_name" : "buenos_aires",
                     "query" : {
                        "match_phrase" : {
                           "text" : "Buenos Aires"
                        }
                     }
                  }
               },
               {
                  "fquery" : {
                     "_name" : "los_angeles",
                     "query" : {
                        "match_phrase" : {
                           "text" : "Los Angeles"
                        }
                     }
                  }
               }
            ]
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "text" : "This is an example of some text written from Buenos Aires about Meat"
#             },
#             "_score" : 1,
#             "_index" : "test",
#             "_id" : "JIwnN_FVTv-0i5YGrlHLeg",
#             "_type" : "test",
#             "matched_filters" : [
#                "buenos_aires"
#             ]
#          }
#       ],
#       "max_score" : 1,
#       "total" : 1
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 58
# }

Note the matched_filters element in the results, indicating which filter matched.

这篇关于用弹性搜索分类文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆