如何从 nutch 中的特定标签中选择数据 [英] How to select data from specific tags in nutch

查看：43 发布时间：2021/6/11 18:42:31 web-scraping web-crawler nutch

本文介绍了如何从 nutch 中的特定标签中选择数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 Apache Nutch 的新手，我想知道是否可以抓取网页的选定区域.例如，选择特定的 div 并仅抓取该 div 中的内容.任何帮助，将不胜感激.谢谢！

I am a newbie in Apache Nutch and I would like to know whether it's possible to crawl selected area of a web page. For instance, select a particular div and crawl contents in that div only. Any help would be appreciated. Thanks!

推荐答案

您必须编写一个插件将扩展 HtmlParseFilter以实现您的目标.

You will have to write a plugin that will extend HtmlParseFilter to achieve your goal.

我认为你会自己做一些事情，比如解析 html 的特定部分，提取你想要的 URL 并将它们添加为外链.

I reckon you will be doing some of the stuff yourself like parsing the html's specific section, extracting the URLs that you want and add them as outlinks.

HtmlParseFilter 实现:(下面的代码给出了总体思路)

HtmlParseFilter implementation: (Code below gives the general idea)

ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc){
    // get html content
    String htmlContent = new String(content.getContent(), StandardCharsets.UTF_8);
    // parse html using jsoup or any other library.
    String url = content.getUrl();
    Parse parse = parseResult.get(url);
    ParseData parseData = parse.getData();
    Outlink[] links = parseData.getOutlinks();
    // modify/select only required outlinks
    // return ParsePesult with modified outlinks
    return parseResult;
}

希望这会有所帮助.

如果你是插件新手，我写了一个简单的插件nutch-fetch-page" 使用 HtmlParseFilter 接口将 html 页面和文本内容保存在本地驱动器上.您可以分叉/下载和修改代码.

If you are new to plugin, I have written a simple plugin "nutch-fetch-page" which saves html pages and text content on a local drive using HtmlParseFilter interface. You can fork/download and modify the code.

这篇关于如何从 nutch 中的特定标签中选择数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从 nutch 中的特定标签中选择数据 [英] How to select data from specific tags in nutch

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从 nutch 中的特定标签中选择数据 [英] How to select data from specific tags in nutch

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭