Apache Nutch 只索引页面内容的一部分 [英] Apache Nutch to index only part of page content

查看：40 发布时间：2021/6/11 18:42:11 solr nutch

本文介绍了Apache Nutch 只索引页面内容的一部分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

将使用 Apache Nutch v1.3 仅从网页中提取某些特定内容.检查 parse-html 插件.似乎它使用 tagsoup 或 nekohtml 规范化每个 html 页面.这很好.我只需要提取网页上和 elemetns 中的文本.如果将提取的文本保存到不同的字段(例如 content_xxx、content_yyy)，那就太好了.我的问题是:我应该编写自己的插件还是可以使用某种标准方式来完成?

Going to use Apache Nutch v1.3 to extract only some specific content from the webpages. Checked parse-html plugin. Seems it normalizes each html page using tagsoup or nekohtml. This is good. I need to extract only text inside <span class='xxx'> and <span class='yyy'> elemetns on the web-page. Would be great if extracted texts are saved into different fields (e.g. content_xxx, content_yyy). My question is: should I write my own plugin or this could be done using some standard way?

最好的方法是在规范化的网页上应用 XSLT 并获得结果.这可能吗?

The best way would be apply XSLT on normalized web-page and get the result. Is that possible?

Apache Nutch 只索引页面内容的一部分 [英] Apache Nutch to index only part of page content

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Nutch 只索引页面内容的一部分 [英] Apache Nutch to index only part of page content

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭