带SQL外部模块的Stormcrawl在抓取Sage时获得ParseFilters异常 [英] Stormcrawl with SQL external module gets ParseFilters exception at crawl sage
问题描述
我将Stromcrawler与SQL外部模块一起使用.我已使用以下命令更新了pop.xml:
I use Stromcrawler with SQL external module. I have updated my pop.xml with:
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-sql</artifactId>
<version>1.8</version>
</dependency>
我使用与ES设置类似的注射器/抓取程序:
I use similar injector/crawl procedure as in the case with ES setup:
storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000
我已经创建了mysql数据库crawl
,表urls
并成功将我的URL注入其中.例如,如果执行select * from crawl.urls limit 5;
,则可以看到url,状态和其他字段.由此得出的结论是,在此阶段,爬网程序已连接到数据库.
I have created mysql database crawl
, table urls
and successfully injected my urls in it. For example, If I do select * from crawl.urls limit 5;
, I can see urls, status, and other fields. From this, I conclude that at this stage, the crawler connects to the database.
Sql注入器看起来像这样:
Sql-injector looks like this:
name: "injector"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "sql-conf.yaml"
override: true
- resource: false
file: "my-config.yaml"
override: true
components:
- id: "scheme"
className: "com.digitalpebble.stormcrawler.util.StringTabScheme"
constructorArgs:
- DISCOVERED
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "seeds.txt"
- ref: "scheme"
bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "status"
grouping:
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byHost"
当我跑步时:
storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote sql-crawler.flux
我在解析螺栓处遇到以下异常:
I got the following exception at the Parse bolt:
java.lang.RuntimeException:在com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare中从com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67)的parsefilters.json加载ParseFilters时捕获到异常(JSoupParserBolt.java:116)在org.apache.storm.daemon.executor $ fn__5043 $ fn__5056.invoke(executor.clj:803)在org.apache.storm.util $ async_loop $ fn__557.invoke(util.clj:482) )在clojure.lang.AFn.run(AFn.java:22)在java.lang.Thread.run(Thread.java:745)造成原因:java.io.IOException:无法从com的文件构建JSON对象. com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62)上的digitalpebble.stormcrawler.parse.ParseFilters.(ParseFilters.java:92)... 5更多原因:com.fasterxml.jackson.core. JsonParseException:意外的字符('}'(代码125)):期望双引号开头字段名称...
java.lang.RuntimeException: Exception caught while loading the ParseFilters from parsefilters.json at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67) at com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare(JSoupParserBolt.java:116) at org.apache.storm.daemon.executor$fn__5043$fn__5056.invoke(executor.clj:803) at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:482) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unable to build JSON object from file at com.digitalpebble.stormcrawler.parse.ParseFilters.(ParseFilters.java:92) at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62) ... 5 more Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('}' (code 125)): was expecting double-quote to start field name...
sql-crawler.flux:
sql-crawler.flux:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "sql-conf.yaml"
override: true
- resource: false
file: "my-config.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.sql.SQLSpout"
parallelism: 100
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
它看起来像ParseFilters.java:60上的对象StringUtils
是空白的.
It looks like object StringUtils
at ParseFilters.java:60 is blank.
推荐答案
检查 src/main/resources.parsefilters.json 的内容(或为 parsefilters设置的任何值.config.file ),则根据错误消息判断,其中包含的JSON无效.别忘了用mvn clean package
Check the content of src/main/resources.parsefilters.json (or whichever value you might have set for parsefilters.config.file), judging by the error message, the JSON it contains is not valid. Don't forget to rebuild the uber jar with mvn clean package
这篇关于带SQL外部模块的Stormcrawl在抓取Sage时获得ParseFilters异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!