如何解析与坚果和索引特定标记的html到solr? [英] how to parse html with nutch and index specific tag to solr?

查看：102 发布时间：2020/9/4 22:59:38 solr nutch apache-tika

本文介绍了如何解析与坚果和索引特定标记的html到solr?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经安装了nutct和solr来爬行网站并在其中搜索；如您所知，我们可以使用nutch的解析元标签插件将网页的元标签索引到solr中.(http://wiki.apache.org/nutch/IndexMetatags)现在我想知道是否有任何方法可以将另一个html标签抓取到不是meta的solr?(无论是插件还是其他方式):

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:

<div id=something>
      me specific tag
</div>

实际上，我想在此页面中向solr(某物)添加一个值为我特定标签"的字段.

indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.

有什么主意吗?

推荐答案

我为自己想要的东西制作了自己的插件. 将NutchDocument映射到SolrDocument的配置文件位于$ NUTCH_HOME/conf/ solrindex-mapping.xml 中.您可以在此处添加自己的标签.但是您仍然必须在某个地方填写自己的标签.

I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

以下是插件的一些提示:

Here are some tips to plugin:

阅读 http://wiki.apache.org/nutch/WritingPluginExample ，在这里您可以找到如何使您的插件变得非常简单
在插件中扩展 ParseFilter 和 IndexingFilter.
在 YourParseFilter 中，您可以使用 NodeWalker 查找特定的div
您解析的信息将被放入页面元数据中

read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
in your plugin extend the ParseFilter and IndexingFilter.
in YourParseFilter you can use NodeWalker to find your specific div
your parsed informations put into page metadata like this

page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

将页面(page.getMetadata)中的元数据添加到NutchDocument

in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

doc.add("your_specific_tag", value);

最重要的！！！！！

将您的特定标记放入以下文件:

Solr 配置文件 schema.xml (并重新启动Solr)

Solr config file schema.xml (and restart Solr)

字段名称="your_specific_tag"类型=字符串"存储="true"已索引="true"

field name="your_specific_tag" type="string" stored="true" indexed="true"

Nutch 配置文件 schema.xml (不知道它是否确实必要)
Nutch 配置文件 solrindex-mapping.xml

Nutch config file schema.xml (don't know if it is realy neccessary)
Nutch config file solrindex-mapping.xml

field dest ="your_specific_tag" source ="your_specific_tag"

field dest="your_specific_tag" source="your_specific_tag"

这篇关于如何解析与坚果和索引特定标记的html到solr?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何解析与坚果和索引特定标记的html到solr? [英] how to parse html with nutch and index specific tag to solr?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何解析与坚果和索引特定标记的html到solr? [英] how to parse html with nutch and index specific tag to solr?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭