Lucene.Net:单词之间的距离相关性 [英] Lucene.Net: Relevancy by distance between words

查看:108
本文介绍了Lucene.Net:单词之间的距离相关性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下代码创建(并经常更新)用户索引(此处出于演示目的而略微缩短了该时间):

I create (and update frequently) the index of users using following code (a bit shortened for demonstration purposes here):

            Lucene.Net.Store.Directory directory = FSDirectory.Open(new System.IO.DirectoryInfo("TestLuceneIndex"));
            StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
            IndexWriter indexWriter = new IndexWriter(directory, standardAnalyzer, IndexWriter.MaxFieldLength.UNLIMITED);
            Document doc = new Document();
            doc.Add(new Field("UID", uid, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
            doc.Add(new Field("GENDER", gender, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
            doc.Add(new Field("COUNTRY", countrycode, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
            doc.Add(new Field("CITY", citycode, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
            doc.Add(new Field("USERDATA", userdata, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
            doc.Add(new Field("USERINFO", userinfo, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
            indexWriter.UpdateDocument(new Term("UID", uid), doc);
            indexWriter.Optimize();
            indexWriter.Commit();
            indexWriter.Close();

存储在索引中的值如下:
UID-用户ID(字符串GUID) 性别-性别ID(字符串"0"(未识别)"1"(男性)或"2"(女性) COUNTRY-国家/地区代码(字符串,例如"US","FR"等) CITY-城市代码(字符串"A121","C432"等) USERDATA-一长串的用户详细信息(例如"John Doe j.doe@gmail.com设计师高等教育5年的经验") USERINFO-关于用户的一长串文本(例如我的名字叫John Doe.我出生了……")

The values, stored in index are as follows:
UID - user id (string GUID) GENDER - id of gender (string "0" (unidentified) "1" (male) or "2" (female) COUNTRY - country code (string like "US", "FR", etc) CITY - city code (string "A121", "C432", etc) USERDATA - long string of user detailes (something like "John Doe j.doe@gmail.com designer high education 5 years of experience") USERINFO - long string of text about user (something like "My name is John Doe. I was born ...")

然后我在索引中执行搜索.我在两个字段(USERDATA和USERINFO)中进行搜索,并且在需要时按GENDER,COUNTRY和CITY筛选结果.结果,我检索了UID(我需要此值来标识DB中用户记录的ID).

Then I perform search in index. I do search in two fields (USERDATA and USERINFO) and whenever it is necessary I do filter the results by GENDER, COUNTRY and CITY. As the result I retrieve UID (I need this value to identify the id of record of user in DB).

这是我用于搜索的代码:

This is a code I use for search:

        Lucene.Net.Store.Directory directory = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo("TestLuceneIndex");
        standardAnalyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
        Lucene.Net.Index.IndexReader indexReader = Lucene.Net.Index.IndexReader.Open(directory, true);
        indexSearcher = new Lucene.Net.Search.IndexSearcher(indexReader);
        Lucene.Net.Search.BooleanQuery booleanQuery = new Lucene.Net.Search.BooleanQuery();
        Lucene.Net.QueryParsers.MultiFieldQueryParser queryTextParser = new Lucene.Net.QueryParsers.MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, new string[] { "USERDATA", "USERINFO" }, standardAnalyzer);
        Lucene.Net.Search.Query queryText = queryTextParser.Parse(SearchText);
        booleanQuery.Add(queryText, Lucene.Net.Search.BooleanClause.Occur.MUST);
        if (searchGender != "0")
        {
            Lucene.Net.Index.Term termGender = new Lucene.Net.Index.Term("GENDER", searchGender);
            Lucene.Net.Search.Query queryGender = new Lucene.Net.Search.TermQuery(termGender);
            booleanQuery.Add(queryGender, Lucene.Net.Search.BooleanClause.Occur.MUST);
        }
        if (searchCity != "0")
        {
            Lucene.Net.Index.Term termCity = new Lucene.Net.Index.Term("CITY", searchCity);
            Lucene.Net.Search.Query queryCity = new Lucene.Net.Search.TermQuery(termCity);
            booleanQuery.Add(queryCity, Lucene.Net.Search.BooleanClause.Occur.MUST);
        }
        if (searchCountry != "0")
        {
            Lucene.Net.Index.Term termCountry = new Lucene.Net.Index.Term("COUNTRY", searchCountry);
            Lucene.Net.Search.Query queryCountry = new Lucene.Net.Search.TermQuery(termCountry);
            booleanQuery.Add(queryCountry, Lucene.Net.Search.BooleanClause.Occur.MUST);
        }
        Lucene.Net.Search.TopScoreDocCollector collector = Lucene.Net.Search.TopScoreDocCollector.create(indexReader.MaxDoc(), true);
        indexSearcher.Search(booleanQuery, collector);
        Lucene.Net.Search.ScoreDoc[] scoreDocs=collector.TopDocs().scoreDocs;
        Lucene.Net.Highlight.Formatter formatter = new Lucene.Net.Highlight.SimpleHTMLFormatter("<b>", "</b>");
        Lucene.Net.Highlight.QueryScorer queryScorer = new Lucene.Net.Highlight.QueryScorer(booleanQuery);
        highlighter = new Lucene.Net.Highlight.Highlighter(formatter, queryScorer);
        Lucene.Net.Highlight.Fragmenter fragmenter = new Lucene.Net.Highlight.SimpleFragmenter(150);
        highlighter.SetTextFragmenter(fragmenter);

除了使用几个词的相关性质量外,其他所有内容都可以很好地工作: 当我搜索实例(Microsoft .net程序员)时,包含确切子字符串的结果的得分不会高于包含这些单词在不同文本位置的结果.我知道,这是由简单的事实造成的,分数计算是基于在文本中搜索字符串的百分比的因素,而不是字符串重合的准确性.但是,如何强制评分算法使资产的准确性更有价值呢? IE.如何强迫被认为在相关性计算中更重要的单词之间的距离?

Everything works well enough except the quality of relevance when using several words: When I search for instance for (microsoft .net programmer) the results, containing exact substring are not scored higher, than results, containing those words in different places of text. I understand, that this is caused by simple fact that score calculation is based on factor of percentage of searching string in text rather than exactness of coincidence of strings. But how to force scoring algorithm to asset exactness more valuable ? I.e. how to force the distance between words found to be considered as more important in calculation of relevancy ?

推荐答案

  1. 最有效(也是最费力的方式)是编写您自己的查询对象,该查询对象可以提高单词与文档的高度相关性. SpanQuery 是一个不错的起点.

最简单的方法是将邻近搜索与常规布尔查询:("search text"~10 || (search && text))一起使用.这将使接近短语匹配更高.

The easiest way would be to use a proximity search along with the regular boolean query: ("search text"~10 || (search && text)). This will bring the proximity phrase matches higher.

4.3. 邻近搜索- Lucene支持查找单词在特定距离内.要进行邻近搜索,请使用代字号〜", 短语末尾的符号.例如搜索"apache" 和雅加达"在文档中彼此相距10个字以内 搜索:雅加达apache"〜10

4.3. Proximity Searches - Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search: "jakarta apache"~10

由于您正在构建自己的查询,因此您甚至可以将"search text"~10提升到"search text"~20以上,而将"search text"~10提升到高于(search && text).

Since you are building your own query, you could even boost "search text"~10 more than "search text"~20 which is boosted higher than (search && text).

这篇关于Lucene.Net:单词之间的距离相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆