如何使用nutch和索引特定标签解析html到solr? [英] how to parse html with nutch and index specific tag to solr?
问题描述
我已经安装了 nutch 和 solr 来抓取网站并在其中进行搜索;如您所知,我们可以使用 nutch 的解析元标签插件将网页的元标签索引到 solr 中.(http://wiki.apache.org/nutch/IndexMetatags)现在我想知道有没有办法抓取另一个 html 标签到solr 不是元?(插件或其他)像这样:
我的特定标签
确实,我想在此页面中向 solr(某物)添加一个具有我的特定标签"值的字段.
有什么想法吗?
我为您想要的类似内容制作了自己的插件.用于将 NutchDocument 映射到 SolrDocument 的配置文件位于 $NUTCH_HOME/conf/solrindex-mapping.xml.您可以在此处添加自己的标签.但是你仍然需要在某处填写你自己的标签.
以下是一些插件提示:
- 阅读http://wiki.apache.org/nutch/WritingPluginExample,在这里你可以找到如何非常简单地制作你的插件
- 在您的插件中扩展 ParseFilter 和 IndexingFilter.
- 在 YourParseFilter 中,您可以使用 NodeWalker 来查找您的特定 div
你解析的信息像这样放入页面元数据
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
在 YourIndexingFilter 中将页面 (page.getMetadata) 中的元数据添加到 NutchDocument
doc.add("your_specific_tag", value);
最重要!!!!!!
将 your_specific_tag 放入以下文件:
- Solr 配置文件 schema.xml(并重启 Solr)
field name="your_specific_tag" type="string" stored="true" indexed="true"
- Nutch 配置文件schema.xml(不知道是不是真的需要)
- Nutch 配置文件 solrindex-mapping.xml
field dest="your_specific_tag" source="your_specific_tag"
i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:
<div id=something>
me specific tag
</div>
indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.
any idea?
I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.
Here are some tips to plugin:
- read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
- in your plugin extend the ParseFilter and IndexingFilter.
- in YourParseFilter you can use NodeWalker to find your specific div
your parsed informations put into page metadata like this
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument
doc.add("your_specific_tag", value);
most important!!!!!
put your_specific_tag to fileds of:
- Solr config file schema.xml (and restart Solr)
field name="your_specific_tag" type="string" stored="true" indexed="true"
- Nutch config file schema.xml (don't know if it is realy neccessary)
- Nutch config file solrindex-mapping.xml
field dest="your_specific_tag" source="your_specific_tag"
这篇关于如何使用nutch和索引特定标签解析html到solr?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!