Lucene通配符匹配在化学符号上失败(?) [英] Lucene wildcard matching fails on chemical notations(?)

查看:104
本文介绍了Lucene通配符匹配在化学符号上失败(?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用休眠搜索注释(大多只是@Field(index = Index.TOKENIZED))我已经索引了许多与我的持久类(化合物)相关的字段.我已经使用

Using Hibernate Search Annotations (mostly just @Field(index = Index.TOKENIZED)) I've indexed a number of fields related to a persisted class of mine called Compound. I've setup text search over all the indexed fields, using the MultiFieldQueryParser, which has so far worked fine.

在可索引的字段和可搜索的字段中有一个名为compoundName的字段,其中包含示例值:

Among the fields indexed and searchable is a field called compoundName, with sample values:

  • 3-Hydroxyflavone
  • 6,4'-Dihydroxyflavone
  • 3-Hydroxyflavone
  • 6,4'-Dihydroxyflavone

当我完全搜索这些值中的任何一个时,将返回相关的Compound实例.但是,当我使用部分名称并引入通配符时,会出现问题:

When I search for either of these values in full the related Compound instances are returned. However problems occur when I use the partial name and introduce wildcards:

  • 搜索3-Hydroxyflav*仍会给出正确的匹配,但是
  • 搜索6,4'-Dihydroxyflav*找不到任何内容.
  • searching for 3-Hydroxyflav* still gives the correct hit, but
  • searching for 6,4'-Dihydroxyflav* fails to find anything.

现在,由于我是Lucene/Hibernate-search的新手,所以我不确定在哪里看待这一点..我认为这可能与第二个查询中出现的'有关,但我不知道如何进行.我应该研究Tokenizers/Analyzers/QueryParsers还是其他东西?

Now as I'm quite new to Lucene / Hibernate-search, I'm not quite sure where to look at this point.. I think it might have something to do with the ' present in the second query, but I don't know how to proceed.. Should I look into Tokenizers / Analyzers / QueryParsers or something else entirely?

或者有人可以告诉我如何才能进行第二个通配符搜索匹配,最好不要破坏MultiField-search行为吗?

Or can anyone tell me how I can get the second wildcard search to match, preferably without breaking the MultiField-search behavior?

我正在使用Hibernate-Search 3.1.0.GA& Lucene核心2.9.3.

I'm using Hibernate-Search 3.1.0.GA & Lucene-core 2.9.3.

一些相关的代码位来说明我当前的方法:

Some relevant code bits to illustrate my current approach:

已索引的Composite类的相关部分:

Relevant parts of the indexed Compound class:

@Entity
@Indexed
@Data
@EqualsAndHashCode(callSuper = false, of = { "inchikey" })
public class Compound extends DomainObject {
    @NaturalId
    @NotEmpty
    @Length(max = 30)
    @Field(index = Index.TOKENIZED)
    private String                  inchikey;

    @ManyToOne
    @IndexedEmbedded
    private ChemicalClass           chemicalClass;

    @Field(index = Index.TOKENIZED)
    private String                  commonName;
...
}

我当前如何搜索被索引的字段:

How I currently search over the indexed fields:

String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser = 
    new MultiFieldQueryParser(Version.LUCENE_29, searchfields, new StandardAnalyzer(Version.LUCENE_29));
FullTextSession fullTextSession = Search.getFullTextSession(getSession());
FullTextQuery fullTextQuery = 
    fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();

推荐答案

我认为您的问题是分析程序和查询语言问题的组合.很难说到底是什么引起了问题.为了找到答案,我建议您使用Lucene索引工具 Luke .

I think your problem is a combination of analyzer and query language problems. It is hard to say what exactly causes the problem. To find this out I recommend you inspect you index using the Lucene index tool Luke.

由于在您的Hibernate Search配置中,您没有使用自定义分析器,因此将使用默认的 StandardAnalyzer .这与您在 MultiFieldQueryParser 的构造函数中使用 StandardAnalyzer 的事实一致(始终使用相同的分析器进行索引和搜索!).我不确定的是 StandardAnalyzer 如何将"6,4'-Dihydroxyflavone"标记化.那第一件事你要弄清楚.例如,javadoc说:

Since in your Hibernate Search configuration you are not using a custom analyzer the default - StandardAnalyzer - is used. This would be consistent with the fact that you use StandardAnalyzer in the constructor of MultiFieldQueryParser (always use the same analyzer for indexing and searching!). What I am not so sure of is how "6,4'-Dihydroxyflavone" gets tokenized by StandardAnalyzer. That the first thing you have to find out. For example the javadoc says:

在连字符处分割单词,除非 令牌中有一个数字,在 在这种情况下,整个令牌是 解释为产品编号,是 不分裂.

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

可能是您需要编写自己的分析器,该分析器以用例需要的方式对化学名称进行标记.

It might be that you need to write your own analyzer which tokenizes your chemical names the way you need it for your use cases.

接下来查询解析器.确保您了解查询语法- Lucene查询语法.一些字符具有特殊含义,例如-".您的查询可能是错误的解析方式.

Next the query parser. Make sure you understand the query syntax - Lucene query syntax. Some characters have special meaning, for example a '-'. It could be that your query is parsed the wrong way.

无论哪种方式,第一步都是os,以了解如何对您的化学名称进行标记.希望有帮助.

Either way, first step os to find out how your chemical names get tokenized. Hope that helps.

这篇关于Lucene通配符匹配在化学符号上失败(?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆