使用solr构建标签云 [英] Building a tag cloud with solr

查看:153
本文介绍了使用solr构建标签云的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

亲爱的stackoverflow社区:

Dear stackoverflow community :

鉴于一些文字,我希望在文本中获得最常见的50个单词,并从中创建一个标签云,并且因此,以图形方式显示文本的主要内容。

Given some text, I wish to get the TOP 50 most frequent words in the text, and create a tag cloud out of it, and thus show the gist of what the text is about in a graphical way.

文本实际上是一组100个左右的评论每个PER ITEM(图片)大约有120个项目,我还希望保持云更新 - 保持评论索引,并在每次新的Web请求出现时使用云生成代码运行。

The text is actually a set of 100 or so comments PER each ITEM(a picture) there are about 120 items, and I also want to keep the cloud updated - by keeping the comments indexed, and using the cloud generation code to run each time a new web request turns up.

我决定使用Solr索引文本,现在想知道如何从Solr TermsVectorComponant 。以下是术语矢量组件返回的结果示例,在您通过说 tv.tf =true打开术语频率后

I settled on using Solr to index the text, and now wondering how to get the TOP 50 words, out of Solr TermsVectorComponant. Here is an example of the results returned by the terms vector componant, after you turn on term frequency by saying tv.tf="true" :

  <lst name="doc-5">
    <str name="uniqueKey">MA147LL/A</str>    
    <lst name="includes">
      <lst name="cabl"><tf>5</tf></lst>
      <lst name="earbud"><tf>3</tf></lst>
      <lst name="headphon"><tf>10</tf></lst>
      <lst name="usb"><tf>11</tf></lst>
    </lst>
  </lst>

  <lst name="doc-9">
    <str name="uniqueKey">3007WFP</str>
    <lst name="includes">
      <lst name="cabl"><tf>5</tf></lst>
      <lst name="usb"><tf>4</tf></lst>
    </lst>
  </lst>

如你所见,我有2个问题:

As you can see I have 2 problems :


  1. 我得到了文档中的所有条款,对于该字段,而不仅仅是前100名

  2. 而且它们没有按频率排序,所以我必须获取条款并在内存中对其进行排序以进行即时尝试。

有更好的方法吗? (或者)我可以告诉solr termvector组件以某种方式对它进行排序并为我提取100个吗? (或)我可以使用其他一些框架吗?我需要保留新的评论标记,因此标签云始终是最新的 - 对于云生成器,它需要一个加权单词的字典,并使其成为一个很好的图像。

Is there a better way? (or) Can I tell solr termvector component to somehow sort it and pick up only 100 for me? (or) Is there some other framework which I can use? I need to keep new comments indexed as they come, so the tag cloud is always uptodate - As to the cloud generator it takes a dictionary of weighted words, and makes it into a nice image.

这个答案没有帮助。

编辑 - 试用jpountz& paige cook的回答

以下是我对此查询的结果:

Here is a result which I got for this query :

    select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true
&facet.field=Post_Content&facet.minCount=1&facet.limit=50

<int name="also">1</int>
<int name="ani">1</int>
<int name="anoth">1</int>
<int name="atleast">1</int>
<int name="base">1</int>
<int name="bcd">1</int>
<int name="becaus">1</int>
<int name="better">1</int>
<int name="bigger">1</int>
<int name="bio">1</int>
<int name="boot">1</int>
<int name="bootabl">1</int>
<int name="bootload">1</int>
<int name="bootscreen">1</int>

我有50个这样的元素,@ jpountz感谢帮助限制结果,但为什么所有的五十个人< int> 元素是否包含值 1 ?我的想法是:数字1表示与我的查询匹配的文档的数量(自我通过Id:Guid查询后,它只能是一个)并且它们不代表 Post_Content中单词的频率

I got 50 such elements, @jpountz thanks for helping limit the results, BUT why does ALL FIFTY of the individual <int> elements hold the value 1? My thoughts are : The number 1 represents the count of the documents matching my query (which can only be one since I queried by Id:Guid) and they do not represent the frequency of the words in Post_Content

为了证明这一点,我从查询中删除了Id:GUID,结果是:

To prove this I removed the Id:GUID from query and result was:

<int name="content">33</int>
<int name="can">17</int>
<int name="on">16</int>
<int name="so">16</int>
<int name="some">16</int>
<int name="all">15</int>
<int name="i">15</int>
<int name="do">14</int>
<int name="have">14</int>
<int name="my">14</int>

我的问题是如何在文档中获取术语频率,而不是许多术语的文档频率。例如,我知道可引导的是一个在Post_content中使用了6次的单词,所以我希望对一组文档进行排序,如(6,可引导),(5,光盘)。

My problem is how to get the term frequency in the document, and not the document frequency of many terms. For example I know for a fact that bootable was a word I used 6 times in Post_content, So i want sorted Pairs like (6,"bootable"), (5, "disc") for a set of documents.

推荐答案

我想出了一个STOPGAP解决方案:(我为每个solr文档调用一个帖子为例)

I have come up with a STOPGAP solution : (Im calling a each solr document a "post" for examples sake)

Solr中有一个术语组件,其目的似乎是公开任何给定字段的所有索引术语。它主要用于实现自动完成等功能,以及在术语级别运行的其他功能。默认情况下,它按频率排序 - 首先出现在字段中更常出现的字词。

There is a terms component in Solr, whose purpose seems to be to expose all the indexed terms of any given field. It is mainly used to implement features like auto-complete, and other features that operate at a term level. And it is by default sorted by frequency - the more frequently occurring terms in the field come up first.

我所做的是创建一个名为<$ c $的动态字段c> content _ 并根据类别在自己的字段中为每个帖子编制索引。这意味着将有数百个动态字段实例,每个实例包含一个后置集,我可以使用该字段上的术语组件来获取该后置集的TOP TERMS。

What I have done is created a dynamic field called content_ and indexed each post-set in its own field based on category. This means that there will be hundreds of instances of the dynamic field each containing one post-set, and I can use the terms component on that field to get TOP TERMS for that post-set.

如图:

content_postSetOne : contains indexed version of a set of posts
content_postSetTwo : contains indexed version of another set of posts
content_postSetThree : contains indexed version of a third set of posts

此解决方案对我有用,如果需要,您还可以轻松地为每个Post创建一个字段。我也有兴趣知道使用这样的动态字段的含义:这会有问题吗?

This solution is sort of working for me, and you can easily create a field per Post also if needed. Im also interested in knowing the implications of using dynamic fields like this : Will this be a problem?

这与Paige和jPountz答案的区别是:

How this is different from the Paige and jPountz answer is :


  1. 术语频率是A或A文档集中的单词数,而不是包含该术语的文档数。

  2. 我可以从一个文档中获得最高发生的术语,如果需要也可以从一组文档中获得。

  3. 我没有使用分面,因为它主要根据文档数量给出频率,而不是根据哪个文档发生单词的次数。

这篇关于使用solr构建标签云的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆