无法让 apache nutch 抓取 - 权限和 JAVA_HOME 怀疑 [英] Can't get apache nutch to crawl - permissions and JAVA_HOME suspected

查看:56
本文介绍了无法让 apache nutch 抓取 - 权限和 JAVA_HOME 怀疑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试按照 NutchTutorial:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

所以我已经安装了 Nutch 并使用 Solr 进行了设置.我将 .bashrc 中的 $JAVA_HOME 设置为 /usr/lib/jvm/java-1.6.0-openjdk-amd64.

So I have Nutch all installed and set up with Solr. I set my $JAVA_HOME in my .bashrc to /usr/lib/jvm/java-1.6.0-openjdk-amd64.

当我从 nutch 主目录运行 bin/nutch 时没有发现任何问题,但是当我尝试按上述方式运行爬网时,我收到以下错误:

I don't see any problems when I run bin/nutch from the nutch home directory, but when I try to run the crawl as above I get the following error:

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /usr/share/nutch/logs/hadoop.log (Permission denied)
        at java.io.FileOutputStream.openAppend(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:207)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)
        at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)
        at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)
        at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)
        at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)
        at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
        at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
        at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)
        at org.apache.log4j.LogManager.<clinit>(LogManager.java:125)
        at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
        at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:270)
        at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:281)
        at org.apache.nutch.crawl.Crawl.<clinit>(Crawl.java:43)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2013-06-28 16:24:53
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:296)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:132)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

我怀疑这可能与文件权限有关,因为我必须在此服务器上的几乎所有内容上运行 sudo,但是如果我使用 sudo 我明白了:

I suspect it might have something to do with file permissions as I have to run sudo on almost everything on this server, but if I run the same crawl command with sudo I get:

Error: JAVA_HOME is not set.

所以我觉得这里发生了第 22 条规则.我应该能够用 sudo 运行这个命令,还是我需要做一些其他的事情,这样我就不必用 sudo 运行它并且它会工作,或者这里还有什么其他的事情吗?

So I feel like I've got a catch-22 situation going on here. Should I be able to run this command with sudo, or is there something else I need to do such that I don't have to run it with sudo and it will work, or is there something else entirely going on here?

推荐答案

看来,作为普通用户,你没有权限写入 /usr/share/nutch/logs/hadoop.log,作为安全功能很有意义.

It seems that, as a normal user, you don't have permission to write to /usr/share/nutch/logs/hadoop.log, which makes sense as security feature.

为了解决这个问题,创建一个简单的 bash 脚本:

To get around this, create a simple bash script:

#!/bin/sh
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64
bin/nutch crawl urls -dir crawl -depth 3 -topN 5

另存为nutch.sh,然后用sudo运行:

sudo sh nutch.sh

这篇关于无法让 apache nutch 抓取 - 权限和 JAVA_HOME 怀疑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆