Lucene,索引已经/外部标记化的令牌并定义自己的分析过程 [英] Lucene, indexing already/externally tokenized tokens and defining own analyzing process
问题描述
例如,如果我有一个 List< MyCustomToken>
,其中已经包含许多令牌(实际上还有很多关于大写的信息等,我也希望将其作为每个MyCustomToken上的特征索引)
如果我很清楚我读过的内容,我需要子类化一个Analyzer,它会调用我自己的tokenizer子类化TokenStream,在那里我只需要提供一个 public final boolean incrementToken()
这将完成插入 TermAttribute
@ position的工作。
BTW这里是我困惑的地方=>这TokenStream是java.io.Reader的子类,因此只能分析流对象,如文件,字符串......
我该如何进行有我自己的文档分析器,将使用我的列表而不是这个流式的?
看起来整个Lucene API的构想基于它首先开始分析@一个非常低级别的角色观点,而我需要开始稍后使用它/插入已经标记化的单词甚至表达式(单词组)。
Lucene用法的典型样本是这样的(取自此处):
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// 1.创建索引
目录索引=新RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,analyzer);
IndexWriter w = new IndexWriter(index,config);
addDoc(w,Lucene in Action); //但是在这里我想要一个addDoc(w,MyOwnObject)
addDoc(w,Lucene for Dummies);
addDoc(w,管理千兆字节);
addDoc(w,计算机科学的艺术);
w.close();
[...]
private static void addDoc(IndexWriter w,String value)throws IOException {
Document doc = new Document();
doc.add(new Field(title,value,Field.Store.YES,Field.Index.ANALYZED));
//所以我可以在这里添加我自己的分析基于许多领域,它们是通过遍历List或复杂结构构建的......
w.addDocument(doc);
}
ps :(我的java / lucene知识仍然很差,所以我可能会错过关于Reader< => List模式的明显内容吗?)
这个问题与我的关于lucene列表几乎相同
编辑: @ Jilles van Gurp =>是的,你是对的,这是另一个问题ai think但是,首先希望找到一个更优雅的解决方案。所以,如果继续,我仍然可以进行某种序列化,将这个序列化的字符串作为文档提供给我自己的分析器,然后自己的标记生成器将反序列化并重新进行一些基本的标记化(实际上,只需要完成已经完成的标记化) ...)顺便说一下,它会添加一些我想避免的更慢,更愚蠢的额外步骤......
关于这部分=>有人有任何样本最近(Lucene> 3.6)自定义标记生成器提供Lucene索引所需的所有基础数据?我已经阅读了关于这样发出令牌的信息:
posIncrement.setPositionIncrement(increment);
char [] asCharArray = myAlreadyTokenizedString.toCharArray(); //这是我的解决方法
termAttribute.copyBuffer(asCharArray,0,asCharArray.length);
//termAttribute.setTermBuffer(kept);
position ++;
为什么我在这里的部分,这是因为我使用了一些外部库,这使我的文本标记化,做一些词性注释,和其他分析(人们可能会想到表达识别或命名实体识别,也可以包括一些关于大写的特殊功能等),我想在Lucene指数中跟踪(我感兴趣的真实部分是索引和查询,而不是分析的第一步,这几乎是来自Lucene图书馆,只有我所读过的令牌化。)
(另外,我不认为我可以通过以前/早期的步骤做更聪明的事情,因为我使用了许多不同的工具,并非所有这些工具都是Java或者可以很容易地包装到Java中)
所以我觉得这有点难过,Lucene瞄准@使用文本是如此受限于字/令牌(字符序列)而文字很多不仅仅是单个/隔离的单词/代币的并置......
您可以使用<$ c而不是尝试实现类似 addDoc(w,MyOwnObject)
的内容$ c> MyOwnObject.toString()并在 MyOwnObject $ c中实现
@Override String toString()
$ c> class?
in the process of using Lucene, I am a bit disapointed. I do not see or understand how i should proceed to feed any Lucene analyzers with something that is already and directly indexable. Or how i should proceed to create my own analyzer...
for example, if i have a List<MyCustomToken>
, which already contains many tokens (and actually many more informations about capitalization, etc. that i would also like to index as features on each of MyCustomToken)
if i understand well what i have read, i need to subclass an Analyzer, which will call my own tokenizer subclassing a TokenStream, where i will only have to provide a public final boolean incrementToken()
that will do the job of inserting TermAttribute
@ position.
BTW here is where i am confused => this TokenStream is a subclass of a java.io.Reader, and thus only capable of analyzing a stream object like a file, a string...
how can i proceed to have my own document analyzer that will consume my List rather thant this stream-ed one?
Looks like the whole Lucene API is builded on the idea that it first starts analyzing @ a very low level that are "characters" point of view, while i need to start using it later / plug from an already tokenized words or even expressions (groups of words).
Typical samples of Lucene usage are like this (taken from here) :
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action"); // BUT here i would like to have a addDoc(w, MyOwnObject)
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();
[...]
private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
// SO that i can add here my own analysis base on many fields, with them built from a walk through List or complex structures...
w.addDocument(doc);
}
ps : (my java/lucene knowledge is still very poor, so i may have miss something obvious about the Reader <=> List pattern?)
this question is almost the same as mine on lucene list
EDIT: @ Jilles van Gurp => yes you are quite right, and it was another issue a i think of, but first hope to find a more elegant solution. So, if continuing, I can still do some kind of serialization, feed this serialized string as a document to my own analyzer, and own tokenizer that will then deserialize and re-do some basic tokenization (actually, just walking through the one already done...) BTW it will add some slower and stupid extra steps that i would have like to avoid...
about this part => does someone have any sample of a recent (Lucene >3.6) custom tokenizer providing all the underlying data necessary to a Lucene Index? i have read about emitting Token like that :
posIncrement.setPositionIncrement(increment);
char[] asCharArray = myAlreadyTokenizedString.toCharArray(); // here is my workaround
termAttribute.copyBuffer(asCharArray, 0, asCharArray.length);
//termAttribute.setTermBuffer(kept);
position++;
for the why i am here part, it is because i use some external libraries, that tokenize my texts, do some part-of-speech annotation, and others analysis (one may think of a expression recognition or named entity recognition, can also include some special features about capitalization, etc.) that i would like to keep track in a Lucene Index (the real part that interest me is Indexing and Querying, not the first step of Analysis which is almost from the Lucene libary only Tokenising for what i have read).
(also, i do not think i can do something smarter from these previous/early steps as i use many different tools, not all of them are Java or could be easily wrapped to Java)
so i think this is a bit sad, that Lucene which is aim @ working with text is so bounded to word/tokens (sequence of chars) while text is much more than only juxtaposition of single/isolated words/tokens...
Instead of tryin to implement something like addDoc(w, MyOwnObject)
, could you use MyOwnObject.toString()
and implement a @Override String toString()
in your MyOwnObject
class?
这篇关于Lucene,索引已经/外部标记化的令牌并定义自己的分析过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!