如何禁止所有动态网址 robots.txt [英] how to disallow all dynamic urls robots.txt
问题描述
如何禁止 robots.txt 中的所有动态网址
how to disallow all dynamic urls in robots.txt
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
我想禁止所有以/?q= 开头的内容
i want to disallow all things that start with /?q=
推荐答案
问题的答案是使用
Disallow: /?q=
我能找到的关于 robots.txt 的最佳(目前可访问)来源是 维基百科.(所谓的权威来源是 http://www.robotstxt.org,但该网站目前已关闭.)
The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)
根据维基百科页面,该标准仅定义了两个字段;UserAgent: 和 Disallow:.Disallow: 字段不允许显式通配符,但每个不允许"的路径实际上是一个路径前缀;即匹配任何以指定值开头的路径.
According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.
Allow: 字段是非标准扩展,Disallow 中对显式通配符的任何支持都将是非标准扩展.如果您使用这些,您就没有权利期望(合法的)网络爬虫会理解它们.
The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.
这不是爬虫聪明"或愚蠢"的问题:这完全与标准合规性和互操作性有关.例如,任何在Disallow:"中使用显式通配符执行智能"操作的网络爬虫对于(假设的)robots.txt 文件都是不利的,因为这些文件旨在按字面解释这些字符.
This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.
这篇关于如何禁止所有动态网址 robots.txt的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!