Apache的Nutch的REST API [英] Apache Nutch REST api
问题描述
我试图通过REST API推出一个爬行。爬网开始与注射的URL。使用Chrome开发人员工具高级REST客户端我试图建立这个POST负载了,但我得到的响应是400错误的请求。
I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request.
POST - 的http://本地主机:8081 /任务/创建
POST - http://localhost:8081/job/create
有效载荷
{
"crawl-id":"crawl-01",
"type":"INJECT",
"config-id":"default",
"args":{ "path/to/seedlist/directory"}
}
我的问题是在指定参数时,我觉得更多的是需要的,但我不知道。在NutchRESTAPI页面,这是它为创造工作的样本。
My problem is in the args, I think more is needed but I'm not sure. In the NutchRESTAPI page this is the sample it gives for creating a job.
POST /job/create
{
"crawlId":"crawl-01",
"type":"FETCH",
"confId":"default",
"args":{"someParam":"someValue"}
}
POST /job/create
{
"crawlId":"crawl-01",
"jobClassName":"org.apache.nutch.fetcher.FetcherJob"
"confId":"default",
"args":{"someParam":"someValue"}
}
我不知道是什么参数或值给每个命令来完成工作。 (如:注入,生成,提取,分析和数据库更新)有人能清楚这件事?我该如何告诉API到哪里寻找的种子列表在哪里?
I'm not sure what param or value to give each of the commands to complete a job. (eg. Inject, Generate, Fetch, Parse, and UpdateDb) Can someone clear this up? How do I tell the api where to look for the seedlist at?
更新
在试图完成我进入到了一个classException错误的位置的TOPN关键字的值是long类型的,但API读取它作为一个字符串或一个int生成命令。我发现包含在2.3.1版本是应该修复(发布日期:TBA)并应用它,并重新编译了code。现在能正常工作。
When trying to complete the Generate command I came into a classException error where the value for the topN key is to be of type long but the api reads it as either a string or an int. I found a fix that is supposed to included in the 2.3.1 release (release date: TBA) and applied it and recompiled my code. It can now work.
推荐答案
在此张贴的时候,REST API还没有完成。一个更详细的文件存在,但它仍然不是COM prehensive。它与从用户邮件列表(你可能要考虑加入)以下电子邮件:
At the time of this posting, the REST API is not yet complete. A much more detailed document exists, though it's still not comprehensive. It is linked to in the following email from the user mailing list (which you might want to consider joining):
http://www.mail-archive.com/用户%40nutch.apache.org / msg13652.html
但是,为了回答你对种子列表的问题,你可以通过REST创建种子列表,也可以使用参数seedDir
But to answer your question about the seedlist, you can create the seedlist through REST, or you can use the argument "seedDir"
{
"args":{
"seedDir":"/path/to/seed/directory"
},
"confId":"default",
"crawlId":"sample-crawl-01",
"type":"INJECT"
}
这篇关于Apache的Nutch的REST API的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!