通过信息检索中的Whoosh语言模型 [英] Language Model through Whoosh in Information Retrieval

查看:172
本文介绍了通过信息检索中的Whoosh语言模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在IR工作.

任何人都可以指导我,如何在Whoosh中实现语言模型. 我已经应用了TD-IDF和BM25.我是IR的新手.

Can any one guide me, how can I implement the language model in Whoosh. I already Applied TD-IDF and BM25. I am new to IR.

例如,语言模型的最简单形式只是丢弃所有条件上下文,并独立估计每个术语.这样的模型称为unigram语言模型:

For an example, the simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:

P_{uni}(t_1t_2t_3t_4) = P(t_1)P(t_2)P(t_3)P(t_4)

还有许多更复杂的语言模型,例如bigram语言模型,它以上一个术语为条件,

There are many more complex kinds of language models, such as bigram language models, which condition on the previous term,

P_{bi}(t_1t_2t_3t_4) = P(t_1)P(t_2\vert t_1)P(t_3\vert t_2)P(t_4\vert t_3)

推荐答案

看看

Take a look at Whoosh's scoring module and use BM25F (lines 276 to 332) as a reference for building your own weighting and scoring models. You need to create a Weighting Model and a Scorer. Assuming you want to call your model Unigram, the main steps would be:

  1. 实现您自己的Unigram加权模型类并从scoring.WeightingModel继承:

  1. Implement your own Unigram weighting model class and inherit from scoring.WeightingModel:

class Unigram(WeightingModel)

实现基类所需的方法,主要方法是scorer(),它返回对Scorer类的引用(下一个).创建您的searcher并定义搜索者将使用的权重模型时,将调用此类.

Implement the methods required by the base class, the main one being scorer(), which returns a reference to your Scorer class (next). This class is called when you create your searcher and defines the Weighting Model the searcher will use.

实现UnigramScorer类并从scoring.WeightLengthScorer继承:

class UnigramScorer(WeightLengthScorer)

实施__init___score方法. __init__ 带有字段名称和值,并在调用searcher.search()时为查询中的每个术语调用一次. 结果中的每个匹配文档都会调用_score.它需要weightlength并返回给定字段的分数.

Implement the __init__ and _score methods. __init__ takes the field name and value and is called once for each term in your query when you call searcher.search(). _score is called for each matching document in your results. It takes a weight and length and returns a score for a given field.

在搜索时创建搜索器时,请使用weighting参数指定自定义语言模型:

When you create your searcher at search time, specify your custom language model using the weighting parameter:

ix.searcher(weighting = Unigram)

这篇关于通过信息检索中的Whoosh语言模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆