如何使用Apache的Nutch抓取来的链接.PDF [英] How to Crawl .pdf links using Apache Nutch

查看：313 发布时间：2016/5/21 13:39:28 apache hadoop nutch

本文介绍了如何使用Apache的Nutch抓取来的链接.PDF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个网站抓取，其中包括一些链接到PDF文件。
我想Nutch的抓取链接，甩掉他们为.pdf文件。
我使用的Apache Nutch1.6也是我在Java作为

I got a website to crawl which includes some links to pdf files. I want nutch to crawl that link and dump them as .pdf files. I am using Apache Nutch1.6 also i am tring this in java as

ToolRunner.run(NutchConfiguration.create(), new Crawl(),
                                 tokenize(crawlArg));
 SegmentReader.main(tokenize(dumpArg));

可以有人帮助我在此

can some one help me on this

推荐答案

如果你想为Nutch的抓取和索引你的PDF文档，您必须启用文档爬行和蒂卡插件：

If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin:

文件爬行

Document crawling

1.1编辑正则表达式，urlfilter.txt并删除PDF

1.1 Edit regex-urlfilter.txt and remove any occurence of "pdf"

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

1.2编辑后缀urlfilter.txt并删除PDF

1.2 Edit suffix-urlfilter.txt and remove any occurence of "pdf"

1.3编辑Nutch的-site.xml中，增加了解析 - 蒂卡和语法分析HTML中的plugin.includes栏目

1.3 Edit nutch-site.xml, add "parse-tika" and "parse-html" in the plugin.includes section

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

如果你真正想要的是从网页下载所有的PDF文件，可以使用类似的传送点在* nix中的Windows 或Wget的。

这篇关于如何使用Apache的Nutch抓取来的链接.PDF的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Apache的Nutch抓取来的链接.PDF [英] How to Crawl .pdf links using Apache Nutch

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

如何使用Apache的Nutch抓取来的链接.PDF [英] How to Crawl .pdf links using Apache Nutch

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭