Nutch - 错误:未设置JAVA_HOME。当试图抓取 [英] Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

查看:196
本文介绍了Nutch - 错误:未设置JAVA_HOME。当试图抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我是Nutch / Hadoop的新手。我已经安装了Cassandra。我在我的EMR集群的主节点上安装了Nutch。当我尝试使用以下命令执行爬网时:

  sudo bin /抓取抓取网址-dir抓取-depth 3 -topN 5 

我得到

 错误:未设置JAVA_HOME。 

如果我运行没有'sudo'的命令,我会得到:

 喷油器:从2014-07-16 02:12:24开始
喷油器:crawlDb:url / crawldb
喷油器:urlDir:抓取
注入器:将注入的URL转换为抓取数据库条目。
Injector:org.apache.hadoop.mapred.InvalidInputException:输入路径不存在:file:/home/hadoop/apache-nutch-1.8/runtime/local/crawl $ b $ org.apache.hadoop .mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient .writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access $ 700( JobClient.java:179)
在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:983)
在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient。 java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache。 hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
位于org.apache.hadoop.mapred.JobClient.submitJobIn ternal(JobClient.java:936)
在org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
在org.apache.hadoop.mapred.JobClient.runJob(JobClient。
at org.apache.nutch.crawl.Injector.inject(Injector.java:279)
at org.apache.nutch.crawl.Injector.run(Injector.java:316)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:306)

我无法弄清楚。我在这里看到了其他论坛:类似话题



并且跟着它无济于事。我添加了

  export JAVA_HOME = / usr / lib / jvm / java-7-oracle 

  export PATH = $ PATH:$ {JAVA_HOME} / bin 

给我的〜/ .bashrc,我正在使用Linux ..



任何帮助都将被赞赏!!

解决方案

问题是我正在运行

  sudo bin /抓取抓取网址-dir抓取-depth 3 -topN 5 

我使用了

  bin / crawl ./urls/seed.txt TestCrawl http:// localhost:8983 / solr / 5 

很好,只是一个格式不正确的命令..即'抓取'已被弃用,如下所述: Apache Nutch教程


First and foremost I'm a Nutch/Hadoop newbie. I have installed Cassandra. I have installed Nutch on the Master node of my EMR cluster. When I attempt to execute a crawl using the following command:

sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5

I get

Error: JAVA_HOME is not set.

If I run the command without 'sudo' I get:

    Injector: starting at 2014-07-16 02:12:24
Injector: crawlDb: urls/crawldb
Injector: urlDir: crawl
Injector: Converting injected urls to crawl db entries.
Injector: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hadoop/apache-nutch-1.8/runtime/local/crawl
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
    at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:279)
    at org.apache.nutch.crawl.Injector.run(Injector.java:316)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:306)

I cannot figure this out. I've seen the other forum here: Similar Topic

and followed it to no avail. I have added

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

and

export PATH=$PATH:${JAVA_HOME}/bin

to my ~/.bashrc and I am using Linux..

Any help will be appreciated!!

解决方案

The problem is I was running

sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5

I used

bin/crawl ./urls/seed.txt TestCrawl http://localhost:8983/solr/ 5

And all is well, just a malformed command.. i.e. 'crawl' is deprecated as stated here: Apache Nutch Tutorial

这篇关于Nutch - 错误:未设置JAVA_HOME。当试图抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆