如何使用 Apache Nutch 保存原始 html 文件 [英] How do I save the origin html file with Apache Nutch

查看:46
本文介绍了如何使用 Apache Nutch 保存原始 html 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是搜索引擎和网络爬虫的新手.现在我想将特定网站中的所有原始页面存储为 html 文件,但是使用 Apache Nutch 我只能获取二进制数据库文件.如何使用 Nutch 获取原始 html 文件?

I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch?

Nutch 支持吗?如果没有,我可以使用哪些其他工具来实现我的目标.(支持分布式抓取的工具更好.)

Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.)

推荐答案

嗯,nutch 会将抓取的数据以二进制形式写入,因此如果您希望将其保存为 html 格式,则必须修改代码.(如果您不熟悉 nutch,这会很痛苦).

Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will have to modify the code. (this will be painful if you are new to nutch).

如果您想要快速简便的获取 html 页面的解决方案:

If you want quick and easy solution for getting html pages:

  1. 如果您打算拥有的页面/网址列表很少,那么最好使用为每个网址调用 wget 的脚本来完成.
  2. 或使用 HTTrack 工具.
  1. If the list of pages/urls that you intend to have is quite low, then better get it done with a script which invokes wget for each url.
  2. OR use HTTrack tool.

编写自己的 nutch 插件会很棒.您的问题将得到解决,而且您可以通过提交您的作品为 nutch 做出贡献!!!如果您不熟悉 nutch(在代码和设计方面),那么您将不得不投入大量时间构建一个新插件……否则很容易做到.

Writing a your own nutch plugin will be great. Your problem will get solved plus you can contribute to nutch by submitting your work !!! If you are new to nutch (in terms of code & design), then you will have to invest lot of time building a new plugin ... else its easy to do.

帮助您主动的几点建议:

这里是一个关于编写自己的 nutch 插件的页面.

Here is a page which talks about writing own nutch plugin.

开始fetcher.java.见第 647-648 行.这是您可以根据每个 url 获取获取内容的地方(对于那些成功获取的页面).

Start with Fetcher.java. See lines 647-648. That is the place where you can get the fetched content on per url basis (for those pages which got fetched successfully).

pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
updateStatus(content.getContent().length);

您应该在此之后立即添加代码来调用您的插件.将 content 对象传递给它.到现在为止,您可能已经猜到 content.getContent() 是您想要的 url 内容.在插件代码中,将其写入某个文件.文件名应基于 url 名称,否则将很难使用.url可以通过fit.url获取.

You should add code right after this to invoke your plugin. Pass content object to it. By now, you would have guessed that content.getContent() is the content for url you want. Inside the plugin code, write it to some file. Filename should be based on the url name else it will be difficult to work with that. Url can be obtained by fit.url.

这篇关于如何使用 Apache Nutch 保存原始 html 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆