如何使用nutch和索引特定标签解析html到solr? [英] how to parse html with nutch and index specific tag to solr?

查看：32 发布时间：2021/11/14 23:44:41 solr nutch apache-tika

本文介绍了如何使用nutch和索引特定标签解析html到solr?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经安装了 nutch 和 solr 来抓取网站并在其中进行搜索；如您所知，我们可以使用 nutch 的解析元标签插件将网页的元标签索引到 solr 中.(http://wiki.apache.org/nutch/IndexMetatags)现在我想知道有没有办法抓取另一个 html 标签到solr 不是元?(插件或其他)像这样:

我的特定标签

确实，我想在此页面中向 solr(某物)添加一个具有我的特定标签"值的字段.

有什么想法吗?

解决方案

我为您想要的类似内容制作了自己的插件.用于将 NutchDocument 映射到 SolrDocument 的配置文件位于 $NUTCH_HOME/conf/solrindex-mapping.xml.您可以在此处添加自己的标签.但是你仍然需要在某处填写你自己的标签.

以下是一些插件提示:

阅读http://wiki.apache.org/nutch/WritingPluginExample，在这里你可以找到如何非常简单地制作你的插件
在您的插件中扩展 ParseFilter 和 IndexingFilter.
在 YourParseFilter 中，您可以使用 NodeWalker 来查找您的特定 div
你解析的信息像这样放入页面元数据
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
在 YourIndexingFilter 中将页面 (page.getMetadata) 中的元数据添加到 NutchDocument
doc.add("your_specific_tag", value);
最重要！！！！！！
将 your_specific_tag 放入以下文件:
- Solr 配置文件 schema.xml(并重启 Solr)
field name="your_specific_tag" type="string" stored="true" indexed="true"
- Nutch 配置文件schema.xml(不知道是不是真的需要)
- Nutch 配置文件 solrindex-mapping.xml
field dest="your_specific_tag" source="your_specific_tag"

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:

<div id=something>
      me specific tag
</div>

indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.

any idea?

解决方案

I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

Here are some tips to plugin:

read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
in your plugin extend the ParseFilter and IndexingFilter.
in YourParseFilter you can use NodeWalker to find your specific div
your parsed informations put into page metadata like this

page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

doc.add("your_specific_tag", value);
most important!!!!!
put your_specific_tag to fileds of:
- Solr config file schema.xml (and restart Solr)
field name="your_specific_tag" type="string" stored="true" indexed="true"
- Nutch config file schema.xml (don't know if it is realy neccessary)
- Nutch config file solrindex-mapping.xml
field dest="your_specific_tag" source="your_specific_tag"

这篇关于如何使用nutch和索引特定标签解析html到solr?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用nutch和索引特定标签解析html到solr? [英] how to parse html with nutch and index specific tag to solr?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用nutch和索引特定标签解析html到solr? [英] how to parse html with nutch and index specific tag to solr?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭