如何解析与坚果和索引特定标记的html到solr? [英] how to parse html with nutch and index specific tag to solr?
问题描述
我已经安装了nutct和solr来爬行网站并在其中搜索;如您所知,我们可以使用nutch的解析元标签插件将网页的元标签索引到solr中.(http://wiki.apache.org/nutch/IndexMetatags)现在我想知道是否有任何方法可以将另一个html标签抓取到不是meta的solr?(无论是插件还是其他方式):
i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:
<div id=something>
me specific tag
</div>
实际上,我想在此页面中向solr(某物)添加一个值为我特定标签"的字段.
indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.
有什么主意吗?
推荐答案
我为自己想要的东西制作了自己的插件. 将NutchDocument映射到SolrDocument的配置文件位于$ NUTCH_HOME/conf/ solrindex-mapping.xml 中.您可以在此处添加自己的标签.但是您仍然必须在某个地方填写自己的标签.
I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.
以下是插件的一些提示:
Here are some tips to plugin:
- 阅读 http://wiki.apache.org/nutch/WritingPluginExample ,在这里您可以找到如何使您的插件变得非常简单
- 在插件中扩展 ParseFilter 和 IndexingFilter.
- 在 YourParseFilter 中,您可以使用 NodeWalker 查找特定的div
-
您解析的信息将被放入页面元数据中
- read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
- in your plugin extend the ParseFilter and IndexingFilter.
- in YourParseFilter you can use NodeWalker to find your specific div
your parsed informations put into page metadata like this
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
将页面(page.getMetadata)中的元数据添加到NutchDocument
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument
doc.add("your_specific_tag", value);
最重要的!!!!!
将您的特定标记放入以下文件:
- Solr 配置文件 schema.xml (并重新启动Solr)
- Solr config file schema.xml (and restart Solr)
字段名称="your_specific_tag"类型=字符串"存储="true"已索引="true"
field name="your_specific_tag" type="string" stored="true" indexed="true"
- Nutch 配置文件 schema.xml (不知道它是否确实必要)
- Nutch 配置文件 solrindex-mapping.xml
- Nutch config file schema.xml (don't know if it is realy neccessary)
- Nutch config file solrindex-mapping.xml
field dest ="your_specific_tag" source ="your_specific_tag"
field dest="your_specific_tag" source="your_specific_tag"
这篇关于如何解析与坚果和索引特定标记的html到solr?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!