使用Lucene(PyLucene)查找单个字段术语 [英] Finding a single fields terms with Lucene (PyLucene)

查看:152
本文介绍了使用Lucene(PyLucene)查找单个字段术语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Lucene的术语向量很陌生-并希望确保我的术语收集工作尽可能高效. 我得到了唯一的术语,然后检索该术语的docFreq()以进行构面.

I'm fairly new to Lucene's Term Vectors - and want to make sure my term gathering is as efficient as it possibly can be. I'm getting the unique terms and then retrieving the docFreq() of the term to perform faceting.

我正在使用以下方法从索引中收集所有文档术语:

I'm gathering all documents terms from the index using:

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
terms = ireader.terms() #Returns TermEnum

这很好用,但是有一种方法只能返回特定字段的字词(在所有文档中)-这样会更有效吗?

This works fine, but is there a way to only return terms for specific fields (across all documents) - wouldn't that be more efficient?

例如:

 ireader.terms(Field="country")

推荐答案

IndexReader.terms()接受可选的Field()对象. 字段对象由两个参数组成,即字段名称和值,lucene将其称为术语字段"和术语文本".

IndexReader.terms() accepts an optional Field() object. Field objects are composed of two arguments, the Field Name, and Value which lucene calls the "Term Field" and the "Term Text".

通过为术语文本"提供一个带有空值的Field参数,我们可以从我们关注的术语开始术语迭代.

By providing a Field argument with an empty value for 'term text' we can start our term iteration at the term we are concerned with.

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
# Query the lucene index for the terms starting at a term named "field_name"
terms = ireader.terms(Term("field_name", "")) #Start at the field "field_name"
facets = {'other': 0}
while terms.next():
    if terms.term().field() != "field_name":  #We've got every value
        break
    print "Field Name:", terms.term().field()
    print "Field Value:", terms.term().text()
    print "Matching Docs:", int(ireader.docFreq(term))

希望其他在PyLucene中寻求如何进行刻面的人会看到这篇文章.关键是按原样索引术语.只是为了完整性,这就是应该为字段值建立索引的方式.

Hopefully others searching for how to perform faceting in PyLucene will see come across this post. The key is indexing terms as-is. Just for completeness this is how field values should be indexed.

dir = SimpleFSDirectory(File(indexdir))
analyzer = StandardAnalyzer(Version.LUCENE_30)
writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))
print "Currently there are %d documents in the index..." % writer.numDocs()
print "Adding %s Documents to Index..." % docs.count()
for val in terms:
    doc = Document()
    #Store the field, as-is, with term-vectors.
    doc.add(Field("field_name", val, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES))
    writer.addDocument(doc)

writer.optimize()
writer.close()

这篇关于使用Lucene(PyLucene)查找单个字段术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆