配置螺母regex-normalize.xml [英] configuring nutch regex-normalize.xml

查看:94
本文介绍了配置螺母regex-normalize.xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用基于Java的Nutch网络搜索软件.为了防止在搜索查询结果中返回重复的(url)结果,当我运行Nutch搜寻器为我的Intranet编制索引时,我试图从被索引的网址中删除(也称为规范化)"jsessionid"的表达式.但是,我对$ NUTCH_HOME/conf/regex-normalize.xml的修改(在运行我的爬网之前)似乎没有任何效果.

I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expressions of 'jsessionid' from the urls being indexed when running the Nutch crawler to index my intranet. However my modifications to $NUTCH_HOME/conf/regex-normalize.xml (prior to running my crawl) do not seem to be having any effect.

  1. 如何确保正在使用我的regex-normalize.xml配置进行爬网?还有

  1. How can I ensure that my regex-normalize.xml configuration is being engaged for my crawl? and,

在爬网/索引编制过程中,哪些正则表达式将成功从URL中删除/规范化"jsessionid"的表达式?

What regular expression will successfully remove/normalize expressions of 'jsessionid' from the url during the crawl/indexing?

以下是我当前的regex-normalize.xml的内容:

The following is the contents of my current regex-normalize.xml:

<?xml version="1.0"?>
<regex-normalize>
<regex>
 <pattern>(.*);jsessionid=(.*)$</pattern>
 <substitution>$1</substitution>
</regex>
<regex>
 <pattern>(.*);jsessionid=(.*)(\&amp;|\&amp;amp;)</pattern>
 <substitution>$1$3</substitution>
</regex>
<regex>
 <pattern>;jsessionid=(.*)</pattern>
 <substitution></substitution>
</regex>
</regex-normalize>

这是我发出的用于运行(测试)抓取"的命令:

Here is the command that I am issuing to run my (test) 'crawl':

bin/nutch crawl urls -dir /tmp/test/crawl_test -depth 3 -topN 500

推荐答案

您正在使用哪个版本的Nutch?我不熟悉Nutch,但是Nutch 1.0的默认下载已经在 regex-normalize.xml 中包含了一条规则,该规则似乎可以解决此问题.

What version of Nutch are you using? I'm not familiar with Nutch but the default download of Nutch 1.0 already contains a rule in regex-normalize.xml which seems to handle this problem.

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

顺便说一句. regex-urlfilter.txt 似乎也包含一些相关性

Btw. regex-urlfilter.txt seems to contain something of relevance too

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

然后 nutch-default.xml 中有一些您可能要检出的设置

Then there are some settings in nutch-default.xml which you might want to check out

urlnormalizer.order
urlnormalizer.regex.file
plugin.includes

如果所有操作都无济于事,那么也许可以:如何强制提取程序使用自定义的nutch-config?

If that all doesn't help maybe this does: How can I force fetcher to use custom nutch-config?

这篇关于配置螺母regex-normalize.xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆