在使用nutch和solr进行爬网或索引时从html中删除菜单 [英] Removing menu's from html during crawl or indexing with nutch and solr

查看:110
本文介绍了在使用nutch和solr进行爬网或索引时从html中删除菜单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓住我们的大型网站,并用solr索引,结果很好。但是,整个网站上有几个菜单结构可以索引和破坏查询的结果。



这些菜单中的每一个都在DIV中明确定义,因此< div id =RHBOX> ...< / div>或< div id =calendar> ...< / div> 和其他几个。



我需要在某些时候删除这些DIVS中的内容。



我猜,正确的地方是索引索引,但无法解决如何。



看起来像(< div id =calendar>)。*?(< \ / div>)但我不能让它在< tokenizer class =solr.PatternTokenizerFactorypattern =(< div id =calendar>)。*?(< \ / div>)/> 我不太确定把它放在schema.xml中。



当我把这个模式放在schema.xml中不解析的时候。 >

解决方案

这是SOLR的补丁,您可以在索引配置中放置以忽略您配置的标签的内容。它只适用于XML,所以如果你可以整理你的HTML,或者你知道它是XHTML,那么这将是有效的,但它不会适用于任何随机的HTML。


I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query.

Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others.

I need to, at some point, delete the content in these DIVS.

I am guessing that the right place is during indexing by solr but cannot work out how.

A pattern would look something like (<div id="calendar">).*?(<\/div>) but i cannot get that to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" /> and I am not really sure where to put it in schema.xml.

When I do put that pattern in schema.xml does not parse.

解决方案

Here is a patch for SOLR that you can place in your indexing config to ignore the contents of tags you configure. It will only work with XML, though, so if you can tidy your HTML or you know that it is XHTML, then this would work, but it won't work with just any random HTML.

这篇关于在使用nutch和solr进行爬网或索引时从html中删除菜单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆