Apache概述:在解析之前处理DOM [英] Apache nutch: Manipulating the DOM before parsing

查看:215
本文介绍了Apache概述:在解析之前处理DOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从页面响应中删除特定元素,然后再将其付诸实践. 具体来说,我想在网页的某些部分加上标记,例如

I want to remove specific elements from the page response, before it is handed down to nutch. Specifically, I want to mark parts of my pages with i.e.

 <div class="noindex">I shall not be indexed</div>

并希望在进行小节分析之前将其删除,以便随后在NutchDocument中不出现我不会被索引".我计划以此围绕导航,页眉,页脚内容,因为现在,它们出现在索引中的每个文档中.

And want to remove them before nutch parse, so that "I shall not be indexed" is not present in the NutchDocument afterwards. I plan die surround my navigation, header, footer content with this because right now, they are present in every document in the index.

谢谢, 保罗

推荐答案

您可以使用一些替代方法:

You have some alternativer for doing that:

使用提取器内容:此处 http://tomazkovacic. com/blog/122/evaluating-text-extraction-algorithms/有一些算法.也许做这件事的最好方法也是在胡说八道.

Using an extractor content: Here http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/ have some algorithmics. Maybe the best way of doing that it´s also in a pluggin in nutch.

这篇关于Apache概述:在解析之前处理DOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆