配置螺母regex-normalize.xml [英] configuring nutch regex-normalize.xml
问题描述
我正在使用基于Java的Nutch网络搜索软件.为了防止在搜索查询结果中返回重复的(url)结果,当我运行Nutch搜寻器为我的Intranet编制索引时,我试图从被索引的网址中删除(也称为规范化)"jsessionid"的表达式.但是,我对$ NUTCH_HOME/conf/regex-normalize.xml的修改(在运行我的爬网之前)似乎没有任何效果.
I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expressions of 'jsessionid' from the urls being indexed when running the Nutch crawler to index my intranet. However my modifications to $NUTCH_HOME/conf/regex-normalize.xml (prior to running my crawl) do not seem to be having any effect.
-
如何确保正在使用我的regex-normalize.xml配置进行爬网?还有
How can I ensure that my regex-normalize.xml configuration is being engaged for my crawl? and,
在爬网/索引编制过程中,哪些正则表达式将成功从URL中删除/规范化"jsessionid"的表达式?
What regular expression will successfully remove/normalize expressions of 'jsessionid' from the url during the crawl/indexing?
以下是我当前的regex-normalize.xml的内容:
The following is the contents of my current regex-normalize.xml:
<?xml version="1.0"?>
<regex-normalize>
<regex>
<pattern>(.*);jsessionid=(.*)$</pattern>
<substitution>$1</substitution>
</regex>
<regex>
<pattern>(.*);jsessionid=(.*)(\&|\&amp;)</pattern>
<substitution>$1$3</substitution>
</regex>
<regex>
<pattern>;jsessionid=(.*)</pattern>
<substitution></substitution>
</regex>
</regex-normalize>
这是我发出的用于运行(测试)抓取"的命令:
Here is the command that I am issuing to run my (test) 'crawl':
bin/nutch crawl urls -dir /tmp/test/crawl_test -depth 3 -topN 500
推荐答案
您正在使用哪个版本的Nutch?我不熟悉Nutch,但是Nutch 1.0的默认下载已经在 regex-normalize.xml 中包含了一条规则,该规则似乎可以解决此问题.
What version of Nutch are you using? I'm not familiar with Nutch but the default download of Nutch 1.0 already contains a rule in regex-normalize.xml which seems to handle this problem.
<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
<pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>
<substitution>$4</substitution>
</regex>
顺便说一句. regex-urlfilter.txt 似乎也包含一些相关性
Btw. regex-urlfilter.txt seems to contain something of relevance too
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
然后 nutch-default.xml 中有一些您可能要检出的设置
Then there are some settings in nutch-default.xml which you might want to check out
urlnormalizer.order
urlnormalizer.regex.file
plugin.includes
如果所有操作都无济于事,那么也许可以:如何强制提取程序使用自定义的nutch-config?
If that all doesn't help maybe this does: How can I force fetcher to use custom nutch-config?
这篇关于配置螺母regex-normalize.xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!