Apache Lucene:如何将集合索引转换为另一种格式? [英] Apache Lucene: how to convert collection index to another format?

查看:77
本文介绍了Apache Lucene:如何将集合索引转换为另一种格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将Apache Lucene生成的索引转换为另一个集合表示形式.

I need to convert an index generated by Apache Lucene into another collection representation.

我目前有一些具有许多属性的文档.

I currently have a collection of documents with many attributes.

我需要从中创建具有相似性度量的文档对,以便将它们传递给分类器.

I need to create document pairs with similarity measures from it, in order to pass them to classifiers.

您知道我可以用来执行此操作的任何教程吗?

Do you know any tutorial I could use to perform this ?

谢谢

推荐答案

相似性度量需要基于查询.也就是说,您查询Lucene文档集,然后获取带有相对分数的文档集.

The similarity measures need to be based on a query. i.e. you query your Lucene document set and you get back a set of documents with relative scores.

如果您要比较每个文档(是这样吗?很难从问题中看出来),那么您需要使用每个文档的功能作为查询的基础.

If you want to compare every document with every other (is that right? it's hard to tell from the question) then you need to use a feature of each document as the basis for the queries.

例如,您可以从每个文档中提取前N个术语(按频率,不包括停用词).如果您有X个文档,那么您将有X个查询.然后,您对索引执行每个X查询,并获得每个文档的相对相似性.这是可用于分类的矩阵.

For example, you could extract the top N terms (by frequency, excluding stop words) from each document. If you have X documents then you will have X queries. Then you execute each of your X queries against the index and you get back relative similarities of each document with every other. This is a matrix you could use for classification.

另一种选择是使用标题或每个文档的提要作为查询的基础(同样,不包括句点).

Another alternative would be to use the title, or synopsis of each document as the basis for the query (again, excluding stops).

这篇关于Apache Lucene:如何将集合索引转换为另一种格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆