Nutch 2与Cassandra作为存储不正确抓取数据 [英] Nutch 2 with Cassandra as a storage is not crawling data properly

查看:226
本文介绍了Nutch 2与Cassandra作为存储不正确抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Nutch 2.x使用Cassandra作为存储。目前我只抓取一个网站,数据正以字节码格式加载到Cassandra。
当我在Nutch中使用readdb命令时,我得到了任何有用的抓取数据。

I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format. When I use readdb command in Nutch, I did get any useful crawling data.

以下是我得到的不同文件和输出的详细信息:

Below are the details of different files and output I am getting:

==== ======命令运行抓取工具===================

bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3


b $ b

======================== seed.txt data ============= =============

http://www.ft.com

===输出readdb命令从cassandra中读取数据webpage.f table == ====

~/Documents/Softwares/apache-nutch-2.3/runtime/local$ bin/nutch readdb -dump data -content
~/Documents/Softwares/apache-nutch-2.3/runtime/local/data$ cat part-r-00000 
http://www.ft.com/  key:    com.ft.www:http/
baseUrl:    null    
status: 4 (status_redir_temp)    
fetchTime:  1426888912463
prevFetchTime:  1424296904936
fetchInterval:  2592000
retriesSinceFetch:  0    
modifiedTime:   0    
prevModifiedTime:   0
protocolStatus: (null)    
parseStatus:    (null)
title:  null
score:  1.0
marker _injmrk_ :   y
marker dist :   0    
reprUrl:    null    
batchId:    1424296906-20007    
metadata _csh_ : 

=============== regex-urlfilter.txt的内容======= ===============

===============content of regex-urlfilter.txt ======================

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.    
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else    
+.

===========打扰的日志文件内容me ====================

2015-02-18 13:57:51,253 ERROR store.CassandraStore - 
2015-02-18 13:57:51,253 ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@653e3e90
2015-02-18 14:01:45,537 INFO  connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s

如果您需要更多信息,请告诉我们。
有人可以帮助我吗?

Please let me know if you need more information. Can someone please help me ?

提前感谢。
-Sumant

Thanks in advance. -Sumant

推荐答案

我刚刚开始使用Nutch和Cassandra。在抓取期间,我的日志文件中未收到相同的错误。

I just started using Nutch and Cassandra today. I am not receiving the same errors in my log file during a crawl.

您仔细检查了nutch-site.xml和gora.properties设置吗?这是我目前配置文件的方式。

Did you double check your nutch-site.xml and gora.properties settings? This is how I currently have my files configured.

nutch-site.xml

nutch-site.xml

    <configuration>
    <property>
    <name>http.agent.name</name>
    <value>My Spider</value>
    </property>
    <property> 
       <name>storage.data.store.class</name> 
       <value>org.apache.gora.cassandra.store.CassandraStore</value>
       <description>Default class for storing data</description>
    </property>
</configuration>

gora.properties

gora.properties

#############################
# CassandraStore properties #
#############################
gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
gora.cassandrastore.servers=localhost:9160

这篇关于Nutch 2与Cassandra作为存储不正确抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆