robots.txt 中的美元符号是什么意思 [英] What does the dollar sign mean in robots.txt
问题描述
我对一个网站很好奇,想在 /s
路径上做一些网络爬虫.它的robots.txt:
I am curious about a website and want to do some web crawling at the /s
path. Its robots.txt:
User-Agent: *
Allow: /$
Allow: /debug/
Allow: /qa/
Allow: /wiki/
Allow: /cgi-bin/loginpage
Disallow: /
我的问题是:
美元符号在这种情况下是什么意思?
What does the dollar-sign mean in this case?
爬取 URL /s
是否合适?关于 robots.txt 文件?
And is it appropriate to crawl the URL /s
? with respect to the robots.txt file?
推荐答案
如果你按照 原版 robots.txt规范,$
没有特殊含义,也没有定义Allow
字段.符合要求的机器人必须忽略它不知道的字段,因此这样的机器人实际上会看到此记录:
If you follow the original robots.txt specification, $
has no special meaning, and there is no Allow
field defined. A conforming bot would have to ignore fields it does not know, therefore such a bot would actually see this record:
User-Agent: *
Disallow: /
然而,原始的robots.txt规范已经被各方扩展.但由于有问题的 robots.txt 的作者没有针对特定的机器人,我们不知道他们想到的是哪个扩展".
However, the original robots.txt specification has been extended by various parties. But as the authors of the robots.txt in question did not target a specific bot, we don’t know which "extension" they had in mind.
通常(但不是必然,因为它没有正式指定),Allow
覆盖 Disallow
中指定的规则,$
代表 URL 路径的结尾.
Typically (but not necessarily, as it’s not formally specified), Allow
overwrites rules specified in Disallow
, and $
represents the end of the URL path.
遵循此解释(例如,Google 使用), Allow:/$
表示:您可以抓取 /
,但不能抓取 /a
, /b代码>等.
Following this interpretation (it’s, for example, used by Google), Allow: /$
would mean: You may crawl /
, but you may not crawl /a
, /b
and so on.
因此不允许抓取路径以 /s
开头的 URL(无论是根据原始规范,感谢 Disallow:/
,也不是根据 Google 的扩展).
So crawling of URLs whose path starts with /s
would not be allowed (neither according to the original spec, thanks to Disallow: /
, nor according to Google’s extension).
这篇关于robots.txt 中的美元符号是什么意思的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!