Apache的Nutch的REST API [英] Apache Nutch REST api

查看:395
本文介绍了Apache的Nutch的REST API的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过REST API推出一个爬行。爬网开始与注射的URL。使用Chrome开发人员工具高级REST客户端我试图建立这个POST负载了,但我得到的响应是400错误的请求。

I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request.

POST - 的http://本地主机:8081 /任务/创建

POST - http://localhost:8081/job/create

有效载荷

{
  "crawl-id":"crawl-01",
  "type":"INJECT",
  "config-id":"default",
  "args":{ "path/to/seedlist/directory"}
}

我的问题是在指定参数时,我觉得更多的是需要的,但我不知道。在NutchRESTAPI页面,这是它为创造工作的样本。

My problem is in the args, I think more is needed but I'm not sure. In the NutchRESTAPI page this is the sample it gives for creating a job.

POST /job/create
   {
      "crawlId":"crawl-01",
      "type":"FETCH",
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

POST /job/create
   {
      "crawlId":"crawl-01",
      "jobClassName":"org.apache.nutch.fetcher.FetcherJob"
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

我不知道是什么参数或值给每个命令来完成工作。 (如:注入,生成,提取,分析和数据库更新)有人能清楚这件事?我该如何告诉API到哪里寻找的种子列表在哪里?

I'm not sure what param or value to give each of the commands to complete a job. (eg. Inject, Generate, Fetch, Parse, and UpdateDb) Can someone clear this up? How do I tell the api where to look for the seedlist at?

更新

在试图完成我进入到了一个classException错误的位置的TOPN关键字的值是long类型的,但API读取它作为一个字符串或一个int生成命令。我发现包含在2.3.1版本是应该修复(发布日期:TBA)并应用它,并重新编译了code。现在能正常工作。

When trying to complete the Generate command I came into a classException error where the value for the topN key is to be of type long but the api reads it as either a string or an int. I found a fix that is supposed to included in the 2.3.1 release (release date: TBA) and applied it and recompiled my code. It can now work.

推荐答案

在此张贴的时候,REST API还没有完成。一个更详细的文件存在,但它仍然不是COM prehensive。它与从用户邮件列表(你可能要考虑加入)以下电子邮件:

At the time of this posting, the REST API is not yet complete. A much more detailed document exists, though it's still not comprehensive. It is linked to in the following email from the user mailing list (which you might want to consider joining):

http://www.mail-archive.com/用户%40nutch.apache.org / msg13652.html

但是,为了回答你对种子列表的问题,你可以通过REST创建种子列表,也可以使用参数seedDir

But to answer your question about the seedlist, you can create the seedlist through REST, or you can use the argument "seedDir"

{
    "args":{
        "seedDir":"/path/to/seed/directory"
    },
    "confId":"default",
    "crawlId":"sample-crawl-01",
    "type":"INJECT"
}

这篇关于Apache的Nutch的REST API的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆