将 URL 参数添加到 Nutch/Solr 索引和搜索结果 [英] Adding URL parameter to Nutch/Solr index and search results

查看:49
本文介绍了将 URL 参数添加到 Nutch/Solr 索引和搜索结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找不到关于如何设置 nutch 以不过滤/删除我的 URL 参数的任何提示.我想抓取和索引一些页面,其中大量内容隐藏在相同的基本 URL 后面(例如 /news.jsp?id=1/news.jsp?id=2/news.jsp?id=3em> 等等).

I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on).

  • regex-normalize.xml 仅从 URL 中删除多余的内容(例如会话 ID 和尾随 ?)
  • regex-urlfilter.txt 似乎对我的主机有一个通配符(+^http://$myHost/)
  • the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?)
  • the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/)

到目前为止,爬行工作正常.有什么想法吗?

The crawling works fine so far. Any ideas?

干杯,法力

部分解决方案隐藏在这里:

A part of the solution is hidden here:

配置 nutch regex-normalize.xml

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

必须修改.必须允许 URL 参数中可能存在的所有字符,例如 '?'和=".新行看起来像

has to be modfied. One has to allow all chars that may exist in a URL parameter like '?' and '='. The new line looks like

-[*!@]

现在使用参数抓取页面.但是它们还没有带参数发送到 Solr(Solr 仍然从链接中截取参数)

And pages are crawled now with params. But they are not yet send to Solr with parameters (Solr still cuts the parameters from the links)

编辑 2:

Nutch 在如何处理相对 url ('?param=value') 方面存在一些问题.仍然停留在那个参数上:

Nutch has some issues on how to handle relative urls ('?param=value'). Still stuck on that Parameter thing:

见maling列表:http://search.lucidimagination.com/search/document/b6011a942b323ba3/problem_with_href_param_value_links

推荐答案

您可以在 Nutch 过滤器中创建自定义字段以保存整个 URL.只要您在 Solr 模式中使用 store="true" 定义相同的字段,它就会显示在您的结果中.请参阅 WritingPluginExample-1.2.

You could create a custom field in a Nutch filter to save the entire URL. As long as you define the same field in the Solr schema with store="true" it will show up in your results. See WritingPluginExample-1.2.

如果您需要帮助,请告诉我.

Let me know if you'd like some help.

这篇关于将 URL 参数添加到 Nutch/Solr 索引和搜索结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆