Lucene，索引已经/外部标记化的令牌并定义自己的分析过程 [英] Lucene, indexing already/externally tokenized tokens and defining own analyzing process

查看：144 发布时间：2018/8/2 15:29:15 java lucene indexing tokenize

本文介绍了Lucene，索引已经/外部标记化的令牌并定义自己的分析过程的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在使用Lucene的过程中，我有点失望。我没有看到或理解我应该如何继续为任何已经可直接转换的Lucene分析仪提供数据。或者我应该如何继续创建我自己的分析器...

例如，如果我有一个 List< MyCustomToken> ，其中已经包含许多令牌（实际上还有很多关于大写的信息等，我也希望将其作为每个MyCustomToken上的特征索引）

如果我很清楚我读过的内容，我需要子类化一个Analyzer，它会调用我自己的tokenizer子类化TokenStream，在那里我只需要提供一个 public final boolean incrementToken（）这将完成插入 TermAttribute @ position的工作。

BTW这里是我困惑的地方=>这TokenStream是java.io.Reader的子类，因此只能分析流对象，如文件，字符串......

我该如何进行有我自己的文档分析器，将使用我的列表而不是这个流式的？

看起来整个Lucene API的构想基于它首先开始分析@一个非常低级别的角色观点，而我需要开始稍后使用它/插入已经标记化的单词甚至表达式（单词组）。

Lucene用法的典型样本是这样的（取自此处）：

  StandardAnalyzer analyzer = new StandardAnalyzer（Version.LUCENE_35）; 
 
 // 1.创建索引
目录索引=新RAMDirectory（）; 
 
 IndexWriterConfig config = new IndexWriterConfig（Version.LUCENE_35，analyzer）; 
 
 IndexWriter w = new IndexWriter（index，config）; 
 addDoc（w，Lucene in Action）; //但是在这里我想要一个addDoc（w，MyOwnObject）
 addDoc（w，Lucene for Dummies）; 
 addDoc（w，管理千兆字节）; 
 addDoc（w，计算机科学的艺术）; 
 w.close（）; 
 
 [...] 
 
 private static void addDoc（IndexWriter w，String value）throws IOException {
 Document doc = new Document（）; 
 doc.add（new Field（title，value，Field.Store.YES，Field.Index.ANALYZED））; 
 //所以我可以在这里添加我自己的分析基于许多领域，它们是通过遍历List或复杂结构构建的...... 
 w.addDocument（doc）; 
}

ps :(我的java / lucene知识仍然很差，所以我可能会错过关于Reader< => List模式的明显内容吗？）

这个问题与我的关于lucene列表几乎相同

编辑： @ Jilles van Gurp =>是的，你是对的，这是另一个问题ai think但是，首先希望找到一个更优雅的解决方案。所以，如果继续，我仍然可以进行某种序列化，将这个序列化的字符串作为文档提供给我自己的分析器，然后自己的标记生成器将反序列化并重新进行一些基本的标记化（实际上，只需要完成已经完成的标记化） ...）顺便说一下，它会添加一些我想避免的更慢，更愚蠢的额外步骤......

关于这部分=>有人有任何样本最近（Lucene> 3.6）自定义标记生成器提供Lucene索引所需的所有基础数据？我已经阅读了关于这样发出令牌的信息：

  posIncrement.setPositionIncrement（increment）; 
 char [] asCharArray = myAlreadyTokenizedString.toCharArray（）; //这是我的解决方法
 termAttribute.copyBuffer（asCharArray，0，asCharArray.length）; 
 //termAttribute.setTermBuffer(kept）; 
 position ++;

为什么我在这里的部分，这是因为我使用了一些外部库，这使我的文本标记化，做一些词性注释，和其他分析（人们可能会想到表达识别或命名实体识别，也可以包括一些关于大写的特殊功能等），我想在Lucene指数中跟踪（我感兴趣的真实部分是索引和查询，而不是分析的第一步，这几乎是来自Lucene图书馆，只有我所读过的令牌化。）

（另外，我不认为我可以通过以前/早期的步骤做更聪明的事情，因为我使用了许多不同的工具，并非所有这些工具都是Java或者可以很容易地包装到Java中）

所以我觉得这有点难过，Lucene瞄准@使用文本是如此受限于字/令牌（字符序列）而文字很多不仅仅是单个/隔离的单词/代币的并置......

解决方案

您可以使用<$ c而不是尝试实现类似 addDoc（w，MyOwnObject）的内容$ c> MyOwnObject.toString（）并在 MyOwnObject @Override String toString（） $ c> class？

in the process of using Lucene, I am a bit disapointed. I do not see or understand how i should proceed to feed any Lucene analyzers with something that is already and directly indexable. Or how i should proceed to create my own analyzer...

for example, if i have a List<MyCustomToken>, which already contains many tokens (and actually many more informations about capitalization, etc. that i would also like to index as features on each of MyCustomToken)

if i understand well what i have read, i need to subclass an Analyzer, which will call my own tokenizer subclassing a TokenStream, where i will only have to provide a public final boolean incrementToken() that will do the job of inserting TermAttribute @ position.

BTW here is where i am confused => this TokenStream is a subclass of a java.io.Reader, and thus only capable of analyzing a stream object like a file, a string...

how can i proceed to have my own document analyzer that will consume my List rather thant this stream-ed one?

Looks like the whole Lucene API is builded on the idea that it first starts analyzing @ a very low level that are "characters" point of view, while i need to start using it later / plug from an already tokenized words or even expressions (groups of words).

Typical samples of Lucene usage are like this (taken from here) :

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

// 1. create the index
Directory index = new RAMDirectory();

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);

IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");   // BUT here i would like to have a addDoc(w, MyOwnObject)
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();

[...]   

private static void addDoc(IndexWriter w, String value) throws IOException {
  Document doc = new Document();
  doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
  // SO that i can add here my own analysis base on many fields, with them built from a walk through List or complex structures...
  w.addDocument(doc);
}

ps : (my java/lucene knowledge is still very poor, so i may have miss something obvious about the Reader <=> List pattern?)

this question is almost the same as mine on lucene list

EDIT: @ Jilles van Gurp => yes you are quite right, and it was another issue a i think of, but first hope to find a more elegant solution. So, if continuing, I can still do some kind of serialization, feed this serialized string as a document to my own analyzer, and own tokenizer that will then deserialize and re-do some basic tokenization (actually, just walking through the one already done...) BTW it will add some slower and stupid extra steps that i would have like to avoid...

about this part => does someone have any sample of a recent (Lucene >3.6) custom tokenizer providing all the underlying data necessary to a Lucene Index? i have read about emitting Token like that :

        posIncrement.setPositionIncrement(increment); 
        char[] asCharArray = myAlreadyTokenizedString.toCharArray(); // here is my workaround 
        termAttribute.copyBuffer(asCharArray, 0, asCharArray.length); 
        //termAttribute.setTermBuffer(kept); 
        position++;

for the why i am here part, it is because i use some external libraries, that tokenize my texts, do some part-of-speech annotation, and others analysis (one may think of a expression recognition or named entity recognition, can also include some special features about capitalization, etc.) that i would like to keep track in a Lucene Index (the real part that interest me is Indexing and Querying, not the first step of Analysis which is almost from the Lucene libary only Tokenising for what i have read).

(also, i do not think i can do something smarter from these previous/early steps as i use many different tools, not all of them are Java or could be easily wrapped to Java)

so i think this is a bit sad, that Lucene which is aim @ working with text is so bounded to word/tokens (sequence of chars) while text is much more than only juxtaposition of single/isolated words/tokens...

解决方案

Instead of tryin to implement something like addDoc(w, MyOwnObject), could you use MyOwnObject.toString() and implement a @Override String toString() in your MyOwnObject class?

这篇关于Lucene，索引已经/外部标记化的令牌并定义自己的分析过程的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Lucene，索引已经/外部标记化的令牌并定义自己的分析过程 [英] Lucene, indexing already/externally tokenized tokens and defining own analyzing process

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Lucene，索引已经/外部标记化的令牌并定义自己的分析过程 [英] Lucene, indexing already/externally tokenized tokens and defining own analyzing process

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭