带SQL外部模块的Stormcrawl在抓取Sage时获得ParseFilters异常 [英] Stormcrawl with SQL external module gets ParseFilters exception at crawl sage

查看:158
本文介绍了带SQL外部模块的Stormcrawl在抓取Sage时获得ParseFilters异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将Stromcrawler与SQL外部模块一起使用.我已使用以下命令更新了pop.xml:

I use Stromcrawler with SQL external module. I have updated my pop.xml with:

<dependency>
        <groupId>com.digitalpebble.stormcrawler</groupId>
        <artifactId>storm-crawler-sql</artifactId>
        <version>1.8</version>
</dependency>

我使用与ES设置类似的注射器/抓取程序:

I use similar injector/crawl procedure as in the case with ES setup:

storm jar target/stromcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000

我已经创建了mysql数据库crawl,表urls并成功将我的URL注入其中.例如,如果执行select * from crawl.urls limit 5;,则可以看到url,状态和其他字段.由此得出的结论是,在此阶段,爬网程序已连接到数据库.

I have created mysql database crawl, table urls and successfully injected my urls in it. For example, If I do select * from crawl.urls limit 5;, I can see urls, status, and other fields. From this, I conclude that at this stage, the crawler connects to the database.

Sql注入器看起来像这样:

Sql-injector looks like this:

name: "injector"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "sql-conf.yaml"
  override: true

- resource: false
  file: "my-config.yaml"
  override: true

components:
 - id: "scheme"
className: "com.digitalpebble.stormcrawler.util.StringTabScheme"
constructorArgs:
  - DISCOVERED

spouts:
 - id: "spout"
  className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
  - "seeds.txt"
  - ref: "scheme"

bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1

streams:
 - from: "spout"
to: "status"
grouping:
  type: CUSTOM
  customClass:
    className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
    constructorArgs:
      - "byHost"

当我跑步时:

storm jar target/stromcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --remote sql-crawler.flux

我在解析螺栓处遇到以下异常:

I got the following exception at the Parse bolt:

java.lang.RuntimeException:在com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare中从com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67)的parsefilters.json加载ParseFilters时捕获到异常(JSoupParserBolt.java:116)在org.apache.storm.daemon.executor $ fn__5043 $ fn__5056.invoke(executor.clj:803)在org.apache.storm.util $ async_loop $ fn__557.invoke(util.clj:482) )在clojure.lang.AFn.run(AFn.java:22)在java.lang.Thread.run(Thread.java:745)造成原因:java.io.IOException:无法从com的文件构建JSON对象. com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62)上的digitalpebble.stormcrawler.parse.ParseFilters.(ParseFilters.java:92)... 5更多原因:com.fasterxml.jackson.core. JsonParseException:意外的字符('}'(代码125)):期望双引号开头字段名称...

java.lang.RuntimeException: Exception caught while loading the ParseFilters from parsefilters.json at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67) at com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare(JSoupParserBolt.java:116) at org.apache.storm.daemon.executor$fn__5043$fn__5056.invoke(executor.clj:803) at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:482) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unable to build JSON object from file at com.digitalpebble.stormcrawler.parse.ParseFilters.(ParseFilters.java:92) at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62) ... 5 more Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('}' (code 125)): was expecting double-quote to start field name...

StormUI的屏幕截图

sql-crawler.flux:

sql-crawler.flux:

name: "crawler"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "sql-conf.yaml"
  override: true

- resource: false
  file: "my-config.yaml"
  override: true

spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.sql.SQLSpout"
parallelism: 100

bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1


streams:
- from: "spout"
to: "partitioner"
grouping:
  type: SHUFFLE

- from: "partitioner"
to: "fetcher"
grouping:
  type: FIELDS
  args: ["key"]

- from: "fetcher"
to: "sitemap"
grouping:
  type: LOCAL_OR_SHUFFLE

- from: "sitemap"
to: "parse"
grouping:
  type: LOCAL_OR_SHUFFLE

- from: "fetcher"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

- from: "sitemap"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

- from: "parse"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

它看起来像ParseFilters.java:60上的对象StringUtils是空白的.

It looks like object StringUtils at ParseFilters.java:60 is blank.

推荐答案

检查 src/main/resources.parsefilters.json 的内容(或为 parsefilters设置的任何值.config.file ),则根据错误消息判断,其中包含的JSON无效.别忘了用mvn clean package

Check the content of src/main/resources.parsefilters.json (or whichever value you might have set for parsefilters.config.file), judging by the error message, the JSON it contains is not valid. Don't forget to rebuild the uber jar with mvn clean package

这篇关于带SQL外部模块的Stormcrawl在抓取Sage时获得ParseFilters异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆