Apache Nutch REST API [英] Apache Nutch REST api

查看:29
本文介绍了Apache Nutch REST API的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过 rest api 启动爬网.爬行从注入 url 开始.使用 chrome 开发者工具Advanced Rest Client",我正在尝试构建这个 POST 有效负载,但我得到的响应是 400 Bad Request.

I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request.

POST - http://localhost:8081/job/create

有效载荷

{
  "crawl-id":"crawl-01",
  "type":"INJECT",
  "config-id":"default",
  "args":{ "path/to/seedlist/directory"}
}

我的问题出在参数上,我认为还需要更多,但我不确定.在 NutchRESTAPI 页面中,这是它为创建作业提供的示例.

My problem is in the args, I think more is needed but I'm not sure. In the NutchRESTAPI page this is the sample it gives for creating a job.

POST /job/create
   {
      "crawlId":"crawl-01",
      "type":"FETCH",
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

POST /job/create
   {
      "crawlId":"crawl-01",
      "jobClassName":"org.apache.nutch.fetcher.FetcherJob"
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

我不确定给每个命令提供什么参数或值来完成工作.(例如.注入、生成、获取、解析和更新数据库)有人可以解决这个问题吗?我如何告诉 api 在何处查找种子列表?

I'm not sure what param or value to give each of the commands to complete a job. (eg. Inject, Generate, Fetch, Parse, and UpdateDb) Can someone clear this up? How do I tell the api where to look for the seedlist at?

更新

在尝试完成 Generate 命令时,我遇到了 classException 错误,其中 topN 键的值是 long 类型,但 api 将其读取为字符串或 int.我找到了一个应该包含在 2.3.1 版本中的修复程序(发布日期:TBA)并应用它并重新编译我的代码.它现在可以工作了.

When trying to complete the Generate command I came into a classException error where the value for the topN key is to be of type long but the api reads it as either a string or an int. I found a fix that is supposed to included in the 2.3.1 release (release date: TBA) and applied it and recompiled my code. It can now work.

推荐答案

在本文发布时,REST API 尚未完成.存在更详细的文档,但仍不全面.它链接到来自用户邮件列表的以下电子邮件(您可能需要考虑加入):

At the time of this posting, the REST API is not yet complete. A much more detailed document exists, though it's still not comprehensive. It is linked to in the following email from the user mailing list (which you might want to consider joining):

http://www.mail-archive.com/用户%40nutch.apache.org/msg13652.html

但是要回答有关种子列表的问题,您可以通过 REST 创建种子列表,也可以使用参数seedDir"

But to answer your question about the seedlist, you can create the seedlist through REST, or you can use the argument "seedDir"

{
    "args":{
        "seedDir":"/path/to/seed/directory"
    },
    "confId":"default",
    "crawlId":"sample-crawl-01",
    "type":"INJECT"
}

这篇关于Apache Nutch REST API的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆