SOLR 精确匹配提升包含精确匹配的文本 [英] SOLR exact match boost over text containing the exact match

查看:33
本文介绍了SOLR 精确匹配提升包含精确匹配的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找不到更好的标题,如果可能的话,我希望稍后根据您的最终建议进行更改.

I could not find a better title, I hope to change it later if possible upon your eventual sugestions.

我的问题:

我有一个包含音乐艺术家的数据库.这些看起来像这样:dre dre feat. akon"、eminem & dr. dre"、dr. dre feat. ll cool j"、dr. dre"、dr. dre feat. eminem & skylar灰色的".我们只有两个字段:id 和 name.

I got a database with music artists. These look like this: "dr. dre feat. akon", "eminem & dr. dre", "dr. dre feat. ll cool j", "dr. dre", "dr. dre feat. eminem & skylar grey". We only have two fields: id and name.

在默认架构 solr 核心上,我运行此查询:q=dr.dre",结果正常但不完美,如下所示:

On a default schema solr core I run this query: "q=dr. dre" and the results are ok but not perfect, looking like this:

  • 博士壮举.阿肯
  • 阿姆&博士.
  • 博士壮举.会酷j
  • 博士
  • ...

请注意,他们得到的分数完全相同.

Note that they got the exact same score.

我想要的是将dr. dre"作为第一个结果,然后是所有其他结果,如下所示:

What I want is to have "dr. dre" as a first result, and then all the others, like this:

  • 博士dre <<-- 博士.dre 是第一位的
  • 阿姆&博士.
  • 博士壮举.会酷j
  • 博士壮举.阿肯
  • ...

我如何实现这一目标?(过滤器、标记器、复制字段等并不重要.正如我在其他论坛上看到的那样,我无法更改 solr 中的代码)

How do I achieve this? (filters, tokenizers, copy fields, etc. ist does not matter. I cannot change code inside solr as I've seen on some other forum suggested)

谢谢.

推荐答案

有几种不同的方法可以让dr. dre"结果首先出现.对于冗长的答案,我深表歉意,但正如 Solr 中经常发生的那样,答案取决于您的优先级和需求.

There are a couple of different ways to get the "dr. dre" result to come up first. I apologize for the lengthy answer, but as often occurs in Solr, the answer depends on your priorities and needs.

这可能是多余的,但我想首先确保您看到每个结果的分数.你的问题并没有完全说明这一点.当您进行查询时,您需要明确告诉 Solr 按分数降序对结果进行排序,尽管这可以在 solrconfig.xml 中设置.我想您已经在这样做了,但为了确保您可以尝试这样的查询:q="dr.dre"&fl=*,score&sort=score desc.这将显示每个结果的计算得分,并首先对得分最高的结果进行排序.

This is probably redundant, but I'd like to start by making sure that you are seeing the scores for each result. Your question didn't make this entirely clear. When you make your query, you need to explicitly tell Solr to sort the results in descending order by their scores, though this can be set up in the solrconfig.xml. I imagine that you are already doing this, but just to make sure, you can try a query like this: q="dr. dre"&fl=*,score&sort=score desc. That will show you the calculated score for each result, and sort the results with the highest scores first.

规范

规范是一个灵活的选项,可以很自然地与 Solr 配合使用.您的 name 字段可能应该有一个映射到 fieldType 条目的 type 值.fieldType 可能应该有 class="solr.TextField"它不应该有 omitNorms="true".除非您在名称字段中明确省略规范,否则 Solr 将在计算文档分数时考虑名称与您的搜索词匹配的程度以及您的搜索词在名称中匹配的次数.dr. dre"将获得最高分,因为名称中的词 100% 与您的搜索匹配.

Norms

Norms are a flexible option that work with Solr fairly naturally. Your name field should probably have a type value that maps to a fieldType entry. The fieldType should probably have class="solr.TextField", and it should not have omitNorms="true". Unless you explicitly omit norms on your name field, Solr will consider how much of the name matches your search terms and how many times your search terms match in the name when calculating the score for a document. "dr. dre" would have the highest score because 100% of the words in the name match your search.

您可以阅读规范并在 Solr 文档维基,或在您下载的 Solr 文档中针对您的特定 Solr 版本.依赖规范的优势在于,除了相当容易实施之外,它们还具有渐进性.因此,虽然dr. dre"将是相关的记录,其名称 100% 与您的搜索匹配,但eminem & dr. dre"也将更多相关而不是完整的男人列表和博士",因为您的搜索词在名称中所占的比例更大.

You can read about norms and see a good general text fieldType configuration on the Solr documentation wiki, or in your downloaded Solr documentation for your particular Solr version. The advantage of relying on norms is that in addition to being fairly easy to implement, they are progressive. So while "dr. dre" would be the most relevant record with 100% of its name matching your search, "eminem & dr. dre" would also be more relevant than "a whole list of guys & also dr. dre" because your search term is a larger proportion of the name.

精确匹配在 Solr 中是一个复杂的问题,主要是因为存在不同程度的精确性",在现实生活中很少需要真正精确的匹配.例如,如果您的记录名为dr. dre",那么dr dre"(不带句号)是否足够准确?是Dr. Dre"吗?是dre 博士"吗?

Exact match is a complicated issue in Solr, largely because there are varying degrees of "exactitude", and a truly exact match is rarely desirable in real life. For example, if your record has the name "dr. dre", is "dr dre" (without the period) close enough to be exact? Is "Dr. Dre"? Is " dr. dre"?

如果您决定实施完全匹配搜索,那么您可能希望在您的 schema.xml 中设置一个复制字段:

If you decide to implement an exact match search, then you will probably want to set up a copyfield in your schema.xml:

<copyField source="name" dest="exactName"/>

然后,您需要同时搜索这两个字段.您如何执行此操作取决于您使用的查询解析器.如果您使用的是 standard/lucene 查询解析器,则您需要使用 OR 搜索来设置您的查询(例如 q=name:"dr. dre" OR exactName:"dr.dre"^4).搜索词后的^4"使该匹配的重要性/相关性是查询中其他地方的匹配的 4 倍.如果您使用 DismaxExtended Dismax 查询解析器,你可以使用更新的qf 字段,它允许你提供用于搜索的字段列表,并将某些字段设置为比其他字段更重要.例如,qf=exactName^4 name&q="dr.dre" 告诉 Solr 检查两个字段中的dr.dre",但考虑精确名称字段中的匹配为 4 倍作为名称字段中的一个相关.(如果这对您有用,可以在 solrconfig.xml 中设置默认的 qf,因此不需要在每次查询时都重新声明.)

Then, you will want to search both fields together. How you do this depends on which query parser you're using. If you are using the standard/lucene query parser, then you will need to set up your queries with OR searching, (e.g. q=name:"dr. dre" OR exactName:"dr. dre"^4). A "^4" after a search term makes that match 4 times as important/relevant as a match elsewhere in the query. If you are using the Dismax or Extended Dismax query parser, you have access to the newer qf field, which allows you to provide a list of fields to use for your search, and to set some up as more important than others. For example qf=exactName^4 name&q="dr. dre" tells Solr to check for "dr. dre" in both fields, but consider the match in the exactName field to be 4 times as relevant as one in the name field. (If this works for you, the default qf can be set in solrconfig.xml so it doesn't need to be restated with every query.)

这使得精确名称字段的 fieldType 未确定.如果您觉得只有完全精确的匹配才有效,并且大小写或标点符号的变化使匹配变得不精确,那么您可以将精确名称字段设置为字符串:

This leaves the fieldType of the exactName field undecided. If you feel that only a completely precise match will work and variations in capitalization or punctuation make a match non-exact, then you could set up the exactName field as a string:

<field name="exactName" type="string" indexed="true" stored="false" multiValued="false"/>

但更有可能的是,您希望允许在什么算作精确"方面有一些变化,在这种情况下,您需要创建一个新的 fieldType,可能使用 Keyword Tokenizer,它不会将确切名称分解为多个索引标记,而是将其保留为单个标记.例如:

But more likely, you will want to allow some variation in what counts as "exact", in which case you will need to make a new fieldType, probably using the Keyword Tokenizer, which will not break the exact name into multiple indexed tokens, but keep it as a single token. For example:

<fieldType name="exactish" class="solr.TextField">
  <analyzer>
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer> 
</fieldType>

<field name="exactName" type="exactish" indexed="true" stored="false" multiValued="false"/>

这个非常基本的示例仅包括将整个名称保留为单个标记的 Keyword Tokenizer,以及确保大写和小写之间的差异无关的小写过滤器.如果您希望完全匹配可以容忍任何其他条件,则需要修改 fieldType 的分析.

This very basic example only includes the Keyword Tokenizer to keep the whole name as a single token, and the Lower Case Filter to make sure that the difference between upper and lower case is not relevant. If you want your exact match to be forgiving of any other conditions, you would need to modify the analysis for the fieldType.

重要提示:在针对字符串字段或具有 Keyword Tokenizer 的文本字段进行搜索时,最好确保您发送到 Solr 的搜索始终带有引号(即词组搜索).否则,您的搜索将在与该字段进行比较之前被分解为单独的术语,并且您的任何一个 术语都不会与整个索引字段相匹配.这可能导致根本无法在该字段中找到任何匹配项,除非值无论如何都不包含空格.如果您只是使用规范来控制具有更标准标记化的 textField 中的相关性,这不是问题.

Important: when searching against a string field, or a text field that has the Keyword Tokenizer, it's a good idea to make sure that the searches you send to Solr always have quotes around them (i.e. phrase search). Otherwise, your search will be broken up into individual terms before ever being compared to the field, and no one of your terms is likely to match the entire indexed field. This can lead to never finding any matches in the field at all except when the values don't contain spaces anyway. This is not an issue if you just use the Norms to control relevance in a textField with more standard tokenization.

这篇关于SOLR 精确匹配提升包含精确匹配的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆