与鸟巢Elasticsearch文件词频 [英] term frequency of documents with Nest Elasticsearch
问题描述
我在elasticsearch新的,并希望得到使用巢elasticsearch特定文档的内容字段的前N项的频率。我已经搜查了很多,发现对我的作品一个合适的答案,但我刚刚得到的,因为它在计算整套文件的条款应该使用条款载体,而不是期限刻面。我知道我应该项向量像下面做一些设置;
I am new in elasticsearch and want to get the top N term frequency of the "content" field of a specific document using Nest elasticsearch. I've searched a lot to find a proper answer that works for me, but I just got that I should use Terms vector and not Term Facet since it counts the terms in the whole set of documents. I know that I should do some settings for Term Vector like below;
[ElasticProperty(Type = Nest.FieldType.attachment, TermVector =Nest.TermVectorOption.with_positions_offsets, Store = true)]
public Attachment File { get; set; }
我搜索得到利用鸟巢Elasticsearch很多特定文档的词频但我发现大约Lucene的和Solr。
我需要鸟巢elasticsearch一个例子。我感谢您的帮助。
I searched for getting the term frequency of a specific document using Nest Elasticsearch a lot but all I found were about Lucene and Solr. I need an example in Nest elasticsearch. I appreciate your help.
还有一个问题;其实溶液(罗布建议的)效果很好,当我想要得到像我的文档的标题字符串的词频。但是,当我改变目标字段的文件的内容,我得到任何结果回来了!为了能够搜索文档的内容,我也跟着在这个环节了答案: ElasticSearch&放;附件类型(NEST C#)并能正常工作,我可以通过文件的内容,但得到它不工作的TF搜索词;下面是它的代码;
One more question; Actually the solution(suggested by Rob) works well when I want to get the Term frequency of a string like the title of my documents. But when I change the target Field to the Content of the documents, I gain no results back! in order to be able to search the content of documents, I followed the answer in this link: ElasticSearch & attachment type (NEST C#) and it works fine and I can search a term through the Content of the document but for getting the TF it does not work; Below is the code for it;
var searchResults = client.TermVector<Document>(t =>t.Id(ID).TermStatistics().Fields(f => f.File));
有没有人有一个解决方案呢?
Does anyone have a solution for it?
推荐答案
您可以通过 client.TermVector(..)做
。下面是一个简单的例子:
You can do this by client.TermVector(..)
. Here is a simple example:
文档类:
public class MyDocument
{
public int Id { get; set; }
[ElasticProperty(TermVector = TermVectorOption.WithPositionsOffsets)]
public string Description { get; set; }
[ElasticProperty(Type = FieldType.Attachment, TermVector =TermVectorOption.WithPositionsOffsetsPayloads, Store = true, Index = FieldIndexOption.Analyzed)]
public Attachment File { get; set; }
}
指一些测试数据:
Index some test data:
var indicesOperationResponse = client.CreateIndex(indexName, c => c
.AddMapping<MyDocument>(m => m.MapFromAttributes()));
var myDocument = new MyDocument {Id = 1, Description = "test cat test"};
client.Index(myDocument);
client.Index(new MyDocument {Id = 2, Description = "river"});
client.Index(new MyDocument {Id = 3, Description = "test"});
client.Index(new MyDocument {Id = 4, Description = "river"});
client.Refresh();
通过NEST检索项统计:
Retrieve term statistics through NEST:
var termVectorResponse = client.TermVector<MyDocument>(t => t
.Document(myDocument)
//.Id(1) //you can specify document by id as well
.TermStatistics()
.Fields(f => f.Description));
foreach (var item in termVectorResponse.TermVectors)
{
Console.WriteLine("Field: {0}", item.Key);
var topTerms = item.Value.Terms.OrderByDescending(x => x.Value.TotalTermFrequency).Take(10);
foreach (var term in topTerms)
{
Console.WriteLine("{0}: {1}", term.Key, term.Value.TermFrequency);
}
}
输出:
Output:
Field: description
cat: 1
test: 2
希望它帮助。
更新的
UPDATE
当我检查映射指数一件事很有趣:
When I checked mapping for index one thing was interesting:
{
"my_index" : {
"mappings" : {
"mydocument" : {
"properties" : {
"file" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"id" : {
"type" : "integer"
}
}
}
}
}
}
有没有关于长期矢量信息
There is no information about term vector.
当我走过感创建的索引:
When I have created index through sense:
PUT http://localhost:9200/my_index/mydocument/_mapping
{
"mydocument": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"term_vector":"with_positions_offsets",
"store": true
}
}
}
}
}
}
我能够检索词统计数据。
I was able to retrieve term statistics.
希望我会回来与合作,通过建立NEST映射。
Hope I'll be back later with working mapping created through NEST.
UPDATE2 的
UPDATE2
根据的格雷格的回答试试这个流利映射:
Based on Greg's answer try this fluent mapping:
var indicesOperationResponse = client.CreateIndex(indexName, c => c
.AddMapping<MyDocument>(m => m
.MapFromAttributes()
.Properties(ps => ps
.Attachment(s => s.Name(p => p.File)
.FileField(ff => ff.Name(f => f.File).TermVector(TermVectorOption.WithPositionsOffsets)))))
);
这篇关于与鸟巢Elasticsearch文件词频的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!