Lucene中的术语文档矩阵 [英] Term-document matrix in Lucene
问题描述
我正在尝试从Lucene获取术语文档矩阵.似乎大多数SO问题都针对具有不同类的过时API.我尝试将这两个问题的见解相结合,以从每个文档中获取术语向量:
I am trying to get a term-document matrix from Lucene. It seems that most of the SO questions are for outdated APIs with different classes. I tried combining insight from these two questions to get a term vector from every document:
相关代码,但是在当前API中无法识别 DocEnum
.如何获得每个文档的术语向量或所有术语的计数?
Relevant code, but DocEnum
is not recognized in the current API. How can I get a term vector or count of all terms for every document?
IndexReader reader = DirectoryReader.open(index);
for (int i = 0; i < reader.maxDoc(); i++) {
Document doc = reader.document(i);
Terms terms = reader.getTermVector(i, "country_text");
if (terms != null && terms.size() > 0) {
// access the terms for this field
TermsEnum termsEnum = terms.iterator();
BytesRef term = null;
// explore the terms for this field
while ((term = termsEnum.next()) != null) {
// enumerate through documents, in this case only one
DocsEnum docsEnum = termsEnum.docs(null, null);
int docIdEnum;
while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
// get the term frequency in the document
System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq());
}
}
}
}
完整代码:
import java.io.*;
import java.util.Iterator;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.JSONValue;
import org.json.simple.parser.JSONParser;
public class LuceneIndex {
public static void main(String[] args) throws IOException, ParseException {
String jsonFilePath = "wiki_data.json";
JSONParser parser = new JSONParser();
// Specify the analyzer for tokenizing text.
StandardAnalyzer analyzer = new StandardAnalyzer();
// create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
try {
JSONArray a = (JSONArray) parser.parse(new FileReader(jsonFilePath));
for (Object o : a) {
JSONObject country = (JSONObject) o;
String countryName = (String) country.get("country_name");
String cityName = (String) country.get("city_name");
String countryText = (String) country.get("country_text");
String cityText = (String) country.get("city_text");
System.out.println(cityName);
addDoc(w, countryName, cityName, countryText, cityText);
}
w.close();
IndexReader reader = DirectoryReader.open(index);
for (int i = 0; i < reader.maxDoc(); i++) {
Document doc = reader.document(i);
Terms terms = reader.getTermVector(i, "country_text");
if (terms != null && terms.size() > 0) {
// access the terms for this field
TermsEnum termsEnum = terms.iterator();
BytesRef term = null;
// explore the terms for this field
while ((term = termsEnum.next()) != null) {
// enumerate through documents, in this case only one
DocsEnum docsEnum = termsEnum.docs(null, null);
int docIdEnum;
while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
// get the term frequency in the document
System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq());
}
}
}
}
// reader can be closed when there
// is no need to access the documents any more.
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (org.json.simple.parser.ParseException e) {
e.printStackTrace();
}
}
private static void addDoc(IndexWriter w, String countryName, String cityName,
String countryText, String cityText) throws IOException {
Document doc = new Document();
doc.add(new StringField("country_name", countryName, Field.Store.YES));
doc.add(new StringField("city_name", cityName, Field.Store.YES));
doc.add(new TextField("country_text", countryText, Field.Store.YES));
doc.add(new TextField("city_text", cityText, Field.Store.YES));
w.addDocument(doc);
}
}
推荐答案
首先感谢您的代码,我有一个小错误,您的代码帮助我完成了它.
First thank for your code I had a little bug and your code helped me to complete it.
对我来说,它适用于:(Lucene 7.2.1)
For me it works with this: (Lucene 7.2.1)
for(int i = 0; i < reader.maxDoc(); i++){
Document doc = reader.document(i);
Terms terms = reader.getTermVector(i, "text");
if (terms != null && terms.size() > 0) {
// access the terms for this field
TermsEnum termsEnum = terms.iterator();
BytesRef term = null;
// explore the terms for this field
while ((term = termsEnum.next()) != null) {
// enumerate through documents, in this case only one
PostingsEnum docsEnum = termsEnum.postings(null);
int docIdEnum;
while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
// get the term frequency in the document
System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq());
}
}
}
}
这里的更改是我使用PostingsEnum.DocsEnum在Lucene 7.2.1中不再可用.
The Change here is I used PostingsEnum. DocsEnum is not available in Lucene 7.2.1 anymore.
但是为什么对您不起作用,是您添加文档的方式:
But why it didn't work for you is how you add your document:
private void addDoc(IndexWriter w, String text, String name, String id) throws IOException {
Document doc = new Document();
// Create own FieldType to store Term Vectors
FieldType ft = new FieldType();
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
ft.setTokenized(true);
ft.setStored(true);
ft.setStoreTermVectors(true); //Store Term Vectors
ft.freeze();
StoredField t = new StoredField("text",text,ft);
doc.add(t);
doc.add(new StringField("name", name, Field.Store.YES));
doc.add(new StringField("id", id, Field.Store.YES));
w.addDocument(doc);
}
您必须创建自己的FieldType.没有一个标准的会保存术语向量.
You have to create your own FieldType. None of the standard ones will save the term vectors.
这篇关于Lucene中的术语文档矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!