运行nutch爬虫时爬取的数据存放在哪里? [英] Where is the crawled data stored when running nutch crawler?

查看:57
本文介绍了运行nutch爬虫时爬取的数据存放在哪里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Nutch 的新手.我需要抓取网页(比如几百个网页),读取抓取的数据并进行一些分析.

I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis.

我点击了链接https://wiki.apache.org/nutch/NutchTutorial(并且集成了 Solr,因为我将来可能需要搜索文本)并使用几个 URL 作为种子运行爬网.

I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed.

现在,我在本地机器中找不到 text/html 数据.在哪里可以找到数据以及以文本格式读取数据的最佳方法是什么?

Now, I don't find the text/html data in my local machine. Where can I find the data and what is the best way to read the data in text format?

  • apache-nutch-1.9
  • solr-4.10.4

推荐答案

爬行结束后,您可以使用 bin/nutch dump 命令转储以纯 html 格式获取的所有 url.

After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format.

用法如下:

$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
   [-segment <segment>]
 -h,--help                show this help message
 -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
                      all others. Defaults to all.
 -outputDir <outputDir>   output directory (which will be created) to host
                      the raw data
 -segment <segment>       the segment(s) to use

例如,您可以执行类似

So for example you could do something like

$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/

这将在 -outputDir 位置创建一个新目录并转储以 html 格式抓取的所有页面.

This would create a new dir at the -outputDir location and dump all the pages crawled in html format.

从 Nutch 转储特定数据的方法还有很多,请查看 https://wiki.apache.org/nutch/CommandLineOptions

There are many more ways of dumping out specific data from Nutch, have a look at https://wiki.apache.org/nutch/CommandLineOptions

这篇关于运行nutch爬虫时爬取的数据存放在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆