如何使用Apache的Nutch抓取来的链接.PDF [英] How to Crawl .pdf links using Apache Nutch
问题描述
我有一个网站抓取,其中包括一些链接到PDF文件。
我想Nutch的抓取链接,甩掉他们为.pdf文件。
我使用的Apache Nutch1.6也是我在Java作为
I got a website to crawl which includes some links to pdf files. I want nutch to crawl that link and dump them as .pdf files. I am using Apache Nutch1.6 also i am tring this in java as
ToolRunner.run(NutchConfiguration.create(), new Crawl(),
tokenize(crawlArg));
SegmentReader.main(tokenize(dumpArg));
可以有人帮助我在此
can some one help me on this
推荐答案
如果你想为Nutch的抓取和索引你的PDF文档,您必须启用文档爬行和蒂卡插件:
If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin:
-
文件爬行
Document crawling
1.1编辑正则表达式,urlfilter.txt并删除PDF
1.1 Edit regex-urlfilter.txt and remove any occurence of "pdf"
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
1.2编辑后缀urlfilter.txt并删除PDF
1.2 Edit suffix-urlfilter.txt and remove any occurence of "pdf"
1.3编辑Nutch的-site.xml中,增加了解析 - 蒂卡和语法分析HTML中的plugin.includes栏目
1.3 Edit nutch-site.xml, add "parse-tika" and "parse-html" in the plugin.includes section
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
如果你真正想要的是从网页下载所有的PDF文件,可以使用类似的传送点在* nix中的Windows 或Wget的。
这篇关于如何使用Apache的Nutch抓取来的链接.PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!