nutch 1.16 在文件系统抓取中跳过文件:/目录样式的链接 [英] nutch 1.16 skips file:/directory styled links in file system crawl

查看:56
本文介绍了nutch 1.16 在文件系统抓取中跳过文件:/目录样式的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用从主要教程 (https://cwiki.apache.org/confluence/display/nutch/FAQ#FAQ-HowdoIindexmylocalfilesystem?)以及其他来源.Nutch 完全能够抓取网页没有问题,但由于某种原因它拒绝扫描本地目录.

I am trying to run nutch as a crawler over some local directories using examples taken from both the main tutorial (https://cwiki.apache.org/confluence/display/nutch/FAQ#FAQ-HowdoIindexmylocalfilesystem?) as well as from other sources. Nutch is perfectly able to crawl the web no problem, but for some reason it refuses to scan local directories.

我的配置文件如下:

regex-urlfilter:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# This change is not necessary but may make your life easier.  
# Any file types you do not want to index need to be added to the list otherwise 
# Nutch will often try to parse them and fail in doing so as it doesnt know 
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$

# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
#   can be leaked by placing links pointing to web interfaces of services
#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
#   or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
#     http://localhost:8080
#     http://127.0.0.1/ .. http://127.255.255.255/
#     http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
#     10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
#     192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
#     172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)

# accept anything else
+.

nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>http.agent.name</name>
 <value>NutchSpiderTest</value>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchSpiderTest,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>I am just testing nutch, please tell me if it's bothering your website</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-file|protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
 <name>file.content.limit</name>
 <value>-1</value>
 <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this behavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.</description>
</property>

</configuration>

最后,我注释掉了regex-normalize.xml的这一部分:

And finally, I commented out this part of regex-normalize.xml:

<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->
<!-- we do not need this with files
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
-->

Cygwin 上运行 nutch,在运行时/本地目录中使用 ant 构建的发行版上的 Windows 10,使用以下命令:

Running nutch on Cygwin, Windows 10 on a distribution built with ant in the runtime/local directory, using the command:

bin/crawl -s dirs dircrawl 2 >& dircrawl.log

使用 dirs 文件夹,其中包含以下 seed.txt 文件(我尝试包含不同版本的链接,因为哪个版本应该工作似乎不一致,但我可以将其归结为我没有找到明确的答案=:

With dirs the folder with the following seed.txt file (I tried to include different versions of the links as it does not seem consistent which version should work, but I could chalk it up to me not having found a definitive answer=:

/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/

dircrawl 是我想将爬行保存到的目录,并将轮数/最大深度指示为2".几秒钟后,nutch crawl 输出如下 hadoop.txt 日志文件:

dircrawl being the directory I want to save the crawl to and instructing number of rounds/max depth to '2'. After a few seconds, nutch crawl outputs the following hadoop.txt log file:

2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: starting at 2020-03-24 14:08:58
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: crawlDb: dircrawl/crawldb
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: urlDir: dirs
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2020-03-24 14:08:58,948 INFO  crawl.Injector - Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
2020-03-24 14:08:59,011 WARN  impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:08:59,888 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:08:59,890 INFO  mapreduce.Job - Running job: job_local1269520609_0001
2020-03-24 14:09:00,897 WARN  crawl.Injector - Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
2020-03-24 14:09:00,902 INFO  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2020-03-24 14:09:00,906 INFO  mapreduce.Job - Job job_local1269520609_0001 running in uber mode : false
2020-03-24 14:09:00,908 INFO  mapreduce.Job -  map 0% reduce 0%
2020-03-24 14:09:01,158 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:01,447 WARN  zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:01,461 INFO  crawl.Injector - Injector: overwrite: false
2020-03-24 14:09:01,461 INFO  crawl.Injector - Injector: update: false
2020-03-24 14:09:01,924 INFO  mapreduce.Job -  map 100% reduce 100%
2020-03-24 14:09:01,926 INFO  mapreduce.Job - Job job_local1269520609_0001 completed successfully
2020-03-24 14:09:01,951 INFO  mapreduce.Job - Counters: 31
    File System Counters
        FILE: Number of bytes read=1857050
        FILE: Number of bytes written=3067581
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=5
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=6
        Input split bytes=289
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=6
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=13
        Total committed heap usage (bytes)=402653184
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    injector
        urls_filtered=5
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=239
2020-03-24 14:09:02,022 INFO  crawl.Injector - Injector: Total urls rejected by filters: 5
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total urls injected after normalization and filtering: 0
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total urls injected but already in CrawlDb: 0
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total new urls injected: 0
2020-03-24 14:09:02,054 INFO  crawl.Injector - Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: starting at 2020-03-24 14:09:08
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: filtering: false
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: normalizing: true
2020-03-24 14:09:08,715 INFO  crawl.Generator - Generator: topN: 50000
2020-03-24 14:09:08,879 WARN  impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:09:10,418 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:09:10,424 INFO  mapreduce.Job - Running job: job_local828841059_0001
2020-03-24 14:09:11,450 INFO  mapreduce.Job - Job job_local828841059_0001 running in uber mode : false
2020-03-24 14:09:11,453 INFO  mapreduce.Job -  map 0% reduce 0%
2020-03-24 14:09:11,784 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2020-03-24 14:09:11,784 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2020-03-24 14:09:11,784 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2020-03-24 14:09:11,816 WARN  zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:12,073 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:12,475 INFO  mapreduce.Job -  map 100% reduce 100%
2020-03-24 14:09:12,505 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:13,485 INFO  mapreduce.Job - Job job_local828841059_0001 completed successfully
2020-03-24 14:09:13,502 INFO  mapreduce.Job - Counters: 30
    File System Counters
        FILE: Number of bytes read=2784859
        FILE: Number of bytes written=4605489
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=0
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=28
        Input split bytes=156
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=28
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=15
        Total committed heap usage (bytes)=603979776
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=98
    File Output Format Counters 
        Bytes Written=16
2020-03-24 14:09:13,502 INFO  crawl.Generator - Generator: number of items rejected during selection:
2020-03-24 14:09:13,521 WARN  crawl.Generator - Generator: 0 records selected for fetching, exiting ...

虽然 dirclaw.txt 日志产生:

While the dirclaw.txt log yields:

Injecting seed URLs
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch inject dircrawl/crawldb dirs
Injector: starting at 2020-03-24 14:08:58
Injector: crawlDb: dircrawl/crawldb
Injector: urlDir: dirs
Injector: Converting injected urls to crawl db entries.
Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 5
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
24 Mar 2020 14:09:02 : Iteration 1 of 2
Generating a new segment
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true dircrawl/crawldb dircrawl/segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2020-03-24 14:09:08
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: number of items rejected during selection:
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

所以基本上现在我有点卡住了.我试图撤消我的一些更改,但无论我做什么,我似乎都无法使配置与本地目录一起工作.有谁知道我做错了什么?

So basically now I am kind of stuck. I've tried to undo some of my changes, but no matter what I do I cannot seem to make the configuration work with local directories. Does anyone know what I'm doing wrong?

推荐答案

NUTCH-1483:

  • 这些种子网址应该有效:
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/

  • 这不是因为cygdrive"被用作主机名:
  • file://cygdrive/c/Users/abc/Desktop/anotherdirectory/
    

    我可以确认在 Linux 上使用 Nutch 1.16 可以抓取文件系统(手头没有 Windows).笔记:- urlfilter-validator 仅用于 Internet URL,因为主机名必须包含一个点- urlnormalizer-regex 的配置文件包含一个特殊的规则来固定文件后的斜杠数量:- 还有一个工具normalizerchecker"- 您也可以尝试使用parsechecker"来快速验证哪种文件形式:根据您的配置,URL 肯定有效:

    I can confirm that crawling file systems works using Nutch 1.16 on Linux (no Windows at hand). Notes: - urlfilter-validator is supposed for internet URLs only because the host name must contain a dot - the configuration file of urlnormalizer-regex contains a special rule to fix the number of slashes after file: - there's also a tool "normalizerchecker" - you might also try "parsechecker" to quickly verify which form of file: URLs definitely works given your configuration:

    $> bin/nutch parsechecker file://var/www/html/
    fetching: file://var/www/html/
    Fetch failed with protocol status: notfound(14), lastModified=0
    
    $> bin/nutch parsechecker file:///var/www/html/
    fetching: file:///var/www/html/
    parsing: file:///var/www/html/
    ...
    Status: success(1,0)
    Title: Index of /mnt/data/var_www_html
    Outlinks: 2
      outlink: toUrl: file:/mnt/data/ anchor: ../
      outlink: toUrl: file:/mnt/data/var_www_html/index.html anchor: index.html
    ...
    

    • 您还应该检查所有带有file"前缀的 Nutch 属性.
    • 这篇关于nutch 1.16 在文件系统抓取中跳过文件:/目录样式的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆