Apache概述:在解析之前处理DOM [英] Apache nutch: Manipulating the DOM before parsing

查看：215 发布时间：2020/11/27 20:27:23 java search indexing nutch

本文介绍了Apache概述:在解析之前处理DOM的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从页面响应中删除特定元素，然后再将其付诸实践. 具体来说，我想在网页的某些部分加上标记，例如

I want to remove specific elements from the page response, before it is handed down to nutch. Specifically, I want to mark parts of my pages with i.e.

 <div class="noindex">I shall not be indexed</div>

并希望在进行小节分析之前将其删除，以便随后在NutchDocument中不出现我不会被索引".我计划以此围绕导航，页眉，页脚内容，因为现在，它们出现在索引中的每个文档中.

And want to remove them before nutch parse, so that "I shall not be indexed" is not present in the NutchDocument afterwards. I plan die surround my navigation, header, footer content with this because right now, they are present in every document in the index.

谢谢，保罗

Apache概述:在解析之前处理DOM [英] Apache nutch: Manipulating the DOM before parsing

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Apache概述:在解析之前处理DOM [英] Apache nutch: Manipulating the DOM before parsing

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭