Solr:查找索引的pdf文档的“文本"字段中的单词计数 [英] Solr: Find words count for 'text' field of an indexed pdf document

查看:286
本文介绍了Solr:查找索引的pdf文档的“文本"字段中的单词计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Solr 4.10 在索引文档的文本字段 中找到最常用的单词.我从一个带有一些文本的文本文件中创建了一个PDF文档,然后使用post.jar将其发布到Solr,并在基于其ID进行查询时得到了如下所示的pdf内容以及该文档的所有元数据.

I am trying to find the most frequent words in the text field of an indexed document using Solr 4.10. I created a PDF document from a text file with some text and posted it to Solr using post.jar and when queried based on its id it gives me pdf contents which are shown below and all meta-data of the document.

<arr name="text">
    <str>sample1</str>
    <str/>
    <str>application/pdf</str>
    <str>
    sample1 sample1.txt cook cook1 book1 book1 book2 nook1 nook1 nook2 nook2 two three four Page 1
    </str>
</arr>

总而言之,我想检测到我们有cook,cook1(每个计数为1)和book1,book2,nook1,nook2(每个计数为2).

In summary I want to detect that we have cook, cook1 with count 1 each and book1,book2,nook1, nook2 with count 2 each.

我使用了 TermVectorComponent 中的TermVectorComponent配置和我的模式. xml 具有文本字段:

I used TermVectorComponent configuration from TermVectorComponent and my schema.xml has the text field:

<field name="text" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

solrconfig.xml 具有

<searchComponent name="tvComponent" class="solr.TermVectorComponent"/>

  <requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy">
    <lst name="defaults">
      <str name="df">text</str>
      <bool name="tv">true</bool>
    </lst>
    <arr name="last-components">
      <str>tvComponent</str>
    </arr>
  </requestHandler>

字段类型'text_general'定义为:

The field type 'text_general' is defined as:

<fieldType class="solr.TextField" name="text_general" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

最后,当我使用以下查询从浏览器中查询该查询时,我认为这是要求在文档的文本"字段中提供ID的字数统计,即

Finally when I query it from browser using following query which I think is requesting the word count in the 'text' field of the document with id provided i.e.

http://localhost:8983/solr/select/?q=id:7e75017b-066d-4257-af10-b770726c7cf4&start=0&rows=100&indent=on&qt=tvrh&tv=true&tv.fl=text&f.text.tv.tf=true&tv.fl=text

它向我返回文档 响应 的所有信息,但字数统计除外.我只想在文本"字段中看到单词计数,就像我们使用rows = 0进行分面(即单词与计数的字符串数组)时获得的响应一样.

it returns me all information of the document response except the word count. I only want to see the word count in the 'text' field just like the response we obtain when we use rows=0 for faceting i.e. an string array of word vs count.

任何帮助将不胜感激.

注意::我试图获取一个文档的文本"字段的词频,而不是所有索引文档的文本"字段的词频.换句话说,如何要求Solr避免丢弃重复的标记或重复的词干标记,以便我们可以在字段中搜索最常用的单词.

NOTE: I am trying to get word frequency of 'text' field of one document not of 'text' field of all indexed documents. In other words, how to ask Solr to avoid throwing away duplicate tokens or duplicate stemmed tokens so we can search for most frequent words in a field.

推荐答案

您无需为此使用术语组件.如果您要标记文本字段,则应该可以像这样轻松地在字段上进行构图:

You don't need to use the terms component for this. If you are tokenizing the text field, you should be able to easily facet on the field like so :

http://localhost:8983/solr/select/?q=id:7e75017b-066d-4257-af10-b770726c7cf4&facet=true&facet.field=text&facet.mincount=1

这将为您提供该文本字段中的标记(单词)列表,按出现频率排序.您还可以使用facet.limit等来调整facet参数以适合您的需求

This will give you a list of tokens (words) in that text field sorted by frequency of occurrence. You can also tweak the facet parameters to suit your needs with facet.limit and etc...

请记住,这将计算该字段中的令牌,因此请确保检查字段分析器/过滤器,以确保获得正确的结果,因为不同的过滤器将不同地生成令牌.

Keep in mind that this will count the tokens in that field, so make sure you review field analyzers/filters to make sure you are getting the correct results since different filters will generate tokens differently.

对于确切的字数,在空格上的标记化加上基本词干可能会使您到达需要的位置.

For exact word count, tokenization on whitespace plus basic stemming will probably get you where you need to be.

这篇关于Solr:查找索引的pdf文档的“文本"字段中的单词计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆