在使用nutch和solr进行爬网或索引时从html中删除菜单 [英] Removing menu's from html during crawl or indexing with nutch and solr
问题描述
这些菜单中的每一个都在DIV中明确定义,因此< div id =RHBOX> ...< / div>或< div id =calendar> ...< / div>
和其他几个。
我需要在某些时候删除这些DIVS中的内容。
我猜,正确的地方是索引索引,但无法解决如何。
看起来像(< div id =calendar>)。*?(< \ / div>)
但我不能让它在< tokenizer class =solr.PatternTokenizerFactorypattern =(< div id =calendar>)。*?(< \ / div>)/>
我不太确定把它放在schema.xml中。
当我把这个模式放在schema.xml中不解析的时候。 >
这是SOLR的补丁,您可以在索引配置中放置以忽略您配置的标签的内容。它只适用于XML,所以如果你可以整理你的HTML,或者你知道它是XHTML,那么这将是有效的,但它不会适用于任何随机的HTML。
I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query.
Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div>
and several others.
I need to, at some point, delete the content in these DIVS.
I am guessing that the right place is during indexing by solr but cannot work out how.
A pattern would look something like (<div id="calendar">).*?(<\/div>)
but i cannot get that to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" />
and I am not really sure where to put it in schema.xml.
When I do put that pattern in schema.xml does not parse.
Here is a patch for SOLR that you can place in your indexing config to ignore the contents of tags you configure. It will only work with XML, though, so if you can tidy your HTML or you know that it is XHTML, then this would work, but it won't work with just any random HTML.
这篇关于在使用nutch和solr进行爬网或索引时从html中删除菜单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!