Nutch - 错误:未设置JAVA_HOME。当试图抓取 [英] Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl
问题描述
sudo bin /抓取抓取网址-dir抓取-depth 3 -topN 5
我得到
错误:未设置JAVA_HOME。
如果我运行没有'sudo'的命令,我会得到:
喷油器:从2014-07-16 02:12:24开始
喷油器:crawlDb:url / crawldb
喷油器:urlDir:抓取
注入器:将注入的URL转换为抓取数据库条目。
Injector:org.apache.hadoop.mapred.InvalidInputException:输入路径不存在:file:/home/hadoop/apache-nutch-1.8/runtime/local/crawl $ b $ org.apache.hadoop .mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient .writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access $ 700( JobClient.java:179)
在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:983)
在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient。 java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache。 hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
位于org.apache.hadoop.mapred.JobClient.submitJobIn ternal(JobClient.java:936)
在org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
在org.apache.hadoop.mapred.JobClient.runJob(JobClient。
at org.apache.nutch.crawl.Injector.inject(Injector.java:279)
at org.apache.nutch.crawl.Injector.run(Injector.java:316)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:306)
我无法弄清楚。我在这里看到了其他论坛:类似话题
并且跟着它无济于事。我添加了
export JAVA_HOME = / usr / lib / jvm / java-7-oracle
和
export PATH = $ PATH:$ {JAVA_HOME} / bin
给我的〜/ .bashrc,我正在使用Linux ..
任何帮助都将被赞赏!!
问题是我正在运行
sudo bin /抓取抓取网址-dir抓取-depth 3 -topN 5
我使用了
bin / crawl ./urls/seed.txt TestCrawl http:// localhost:8983 / solr / 5
很好,只是一个格式不正确的命令..即'抓取'已被弃用,如下所述: Apache Nutch教程
First and foremost I'm a Nutch/Hadoop newbie. I have installed Cassandra. I have installed Nutch on the Master node of my EMR cluster. When I attempt to execute a crawl using the following command:
sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5
I get
Error: JAVA_HOME is not set.
If I run the command without 'sudo' I get:
Injector: starting at 2014-07-16 02:12:24
Injector: crawlDb: urls/crawldb
Injector: urlDir: crawl
Injector: Converting injected urls to crawl db entries.
Injector: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hadoop/apache-nutch-1.8/runtime/local/crawl
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:279)
at org.apache.nutch.crawl.Injector.run(Injector.java:316)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:306)
I cannot figure this out. I've seen the other forum here: Similar Topic
and followed it to no avail. I have added
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
and
export PATH=$PATH:${JAVA_HOME}/bin
to my ~/.bashrc and I am using Linux..
Any help will be appreciated!!
The problem is I was running
sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5
I used
bin/crawl ./urls/seed.txt TestCrawl http://localhost:8983/solr/ 5
And all is well, just a malformed command.. i.e. 'crawl' is deprecated as stated here: Apache Nutch Tutorial
这篇关于Nutch - 错误:未设置JAVA_HOME。当试图抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!