Lucene,索引已经/外部标记化的令牌并定义自己的分析过程 [英] Lucene, indexing already/externally tokenized tokens and defining own analyzing process

查看:144
本文介绍了Lucene,索引已经/外部标记化的令牌并定义自己的分析过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用Lucene的过程中,我有点失望。我没有看到或理解我应该如何继续为任何已经可直接转换的Lucene分析仪提供数据。或者我应该如何继续创建我自己的分析器...



例如,如果我有一个 List< MyCustomToken> ,其中已经包含许多令牌(实际上还有很多关于大写的信息等,我也希望将其作为每个MyCustomToken上的特征索引)



如果我很清楚我读过的内容,我需要子类化一个Analyzer,它会调用我自己的tokenizer子类化TokenStream,在那里我只需要提供一个 public final boolean incrementToken()这将完成插入 TermAttribute @ position的工作。



BTW这里是我困惑的地方=>这TokenStream是java.io.Reader的子类,因此只能分析流对象,如文件,字符串......



我该如何进行有我自己的文档分析器,将使用我的列表而不是这个流式的?



看起来整个Lucene API的构想基于它首先开始分析@一个非常低级别的角色观点,而我需要开始稍后使用它/插入已经标记化的单词甚至表达式(单词组)。



Lucene用法的典型样本是这样的(取自此处):

  StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35); 

// 1.创建索引
目录索引=新RAMDirectory();

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,analyzer);

IndexWriter w = new IndexWriter(index,config);
addDoc(w,Lucene in Action); //但是在这里我想要一个addDoc(w,MyOwnObject)
addDoc(w,Lucene for Dummies);
addDoc(w,管理千兆字节);
addDoc(w,计算机科学的艺术);
w.close();

[...]

private static void addDoc(IndexWriter w,String value)throws IOException {
Document doc = new Document();
doc.add(new Field(title,value,Field.Store.YES,Field.Index.ANALYZED));
//所以我可以在这里添加我自己的分析基于许多领域,它们是通过遍历List或复杂结构构建的......
w.addDocument(doc);
}




ps :(我的java / lucene知识仍然很差,所以我可能会错过关于Reader< => List模式的明显内容吗?)



这个问题与我的关于lucene列表几乎相同



编辑: @ Jilles van Gurp =>是的,你是对的,这是另一个问题ai think但是,首先希望找到一个更优雅的解决方案。所以,如果继续,我仍然可以进行某种序列化,将这个序列化的字符串作为文档提供给我自己的分析器,然后自己的标记生成器将反序列化并重新进行一些基本的标记化(实际上,只需要完成已经完成的标记化) ...)顺便说一下,它会添加一些我想避免的更慢,更愚蠢的额外步骤......



关于这部分=>有人有任何样本最近(Lucene> 3.6)自定义标记生成器提供Lucene索引所需的所有基础数​​据?我已经阅读了关于这样发出令牌的信息:

  posIncrement.setPositionIncrement(increment); 
char [] asCharArray = myAlreadyTokenizedString.toCharArray(); //这是我的解决方法
termAttribute.copyBuffer(asCharArray,0,asCharArray.length);
//termAttribute.setTermBuffer(kept);
position ++;

为什么我在这里的部分,这是因为我使用了一些外部库,这使我的文本标记化,做一些词性注释,和其他分析(人们可能会想到表达识别或命名实体识别,也可以包括一些关于大写的特殊功能等),我想在Lucene指数中跟踪(我感兴趣的真实部分是索引查询,而不是分析的第一步,这几乎是来自Lucene图书馆,只有我所读过的令牌化。)



(另外,我不认为我可以通过以前/早期的步骤做更聪明的事情,因为我使用了许多不同的工具,并非所有这些工具都是Java或者可以很容易地包装到Java中)



所以我觉得这有点难过,Lucene瞄准@使用文本是如此受限于字/令牌(字符序列)而文字很多不仅仅是单个/隔离的单词/代币的并置......

解决方案

您可以使用<$ c而不是尝试实现类似 addDoc(w,MyOwnObject)的内容$ c> MyOwnObject.toString()并在 MyOwnObject @Override String toString() $ c> class?


in the process of using Lucene, I am a bit disapointed. I do not see or understand how i should proceed to feed any Lucene analyzers with something that is already and directly indexable. Or how i should proceed to create my own analyzer...

for example, if i have a List<MyCustomToken>, which already contains many tokens (and actually many more informations about capitalization, etc. that i would also like to index as features on each of MyCustomToken)

if i understand well what i have read, i need to subclass an Analyzer, which will call my own tokenizer subclassing a TokenStream, where i will only have to provide a public final boolean incrementToken() that will do the job of inserting TermAttribute @ position.

BTW here is where i am confused => this TokenStream is a subclass of a java.io.Reader, and thus only capable of analyzing a stream object like a file, a string...

how can i proceed to have my own document analyzer that will consume my List rather thant this stream-ed one?

Looks like the whole Lucene API is builded on the idea that it first starts analyzing @ a very low level that are "characters" point of view, while i need to start using it later / plug from an already tokenized words or even expressions (groups of words).

Typical samples of Lucene usage are like this (taken from here) :

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

// 1. create the index
Directory index = new RAMDirectory();

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);

IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");   // BUT here i would like to have a addDoc(w, MyOwnObject)
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();

[...]   

private static void addDoc(IndexWriter w, String value) throws IOException {
  Document doc = new Document();
  doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
  // SO that i can add here my own analysis base on many fields, with them built from a walk through List or complex structures...
  w.addDocument(doc);
}


ps : (my java/lucene knowledge is still very poor, so i may have miss something obvious about the Reader <=> List pattern?)

this question is almost the same as mine on lucene list

EDIT: @ Jilles van Gurp => yes you are quite right, and it was another issue a i think of, but first hope to find a more elegant solution. So, if continuing, I can still do some kind of serialization, feed this serialized string as a document to my own analyzer, and own tokenizer that will then deserialize and re-do some basic tokenization (actually, just walking through the one already done...) BTW it will add some slower and stupid extra steps that i would have like to avoid...

about this part => does someone have any sample of a recent (Lucene >3.6) custom tokenizer providing all the underlying data necessary to a Lucene Index? i have read about emitting Token like that :

        posIncrement.setPositionIncrement(increment); 
        char[] asCharArray = myAlreadyTokenizedString.toCharArray(); // here is my workaround 
        termAttribute.copyBuffer(asCharArray, 0, asCharArray.length); 
        //termAttribute.setTermBuffer(kept); 
        position++; 

for the why i am here part, it is because i use some external libraries, that tokenize my texts, do some part-of-speech annotation, and others analysis (one may think of a expression recognition or named entity recognition, can also include some special features about capitalization, etc.) that i would like to keep track in a Lucene Index (the real part that interest me is Indexing and Querying, not the first step of Analysis which is almost from the Lucene libary only Tokenising for what i have read).

(also, i do not think i can do something smarter from these previous/early steps as i use many different tools, not all of them are Java or could be easily wrapped to Java)

so i think this is a bit sad, that Lucene which is aim @ working with text is so bounded to word/tokens (sequence of chars) while text is much more than only juxtaposition of single/isolated words/tokens...

解决方案

Instead of tryin to implement something like addDoc(w, MyOwnObject), could you use MyOwnObject.toString() and implement a @Override String toString() in your MyOwnObject class?

这篇关于Lucene,索引已经/外部标记化的令牌并定义自己的分析过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆