Robots.txt - 多个用户代理的抓取延迟的正确格式是什么? [英] Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

查看:491
本文介绍了Robots.txt - 多个用户代理的抓取延迟的正确格式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是为每个用户代理添加多个用户代理的示例 robots.txt 文件,其中多个用户代理抓取延迟值仅用于说明目的,并且在实际的robots.txt文件中会有所不同。



我在网上搜索了所有正确答案但找不到一。有太多的混合建议,我不知道哪个是正确的/正确的方法。



问题:



(1)每个用户代理是否有自己的爬行延迟? (我想是的)



(2)在Allow / Dissallow行之前或之后,你在哪里放置了每个用户代理的爬行延迟线?



(3)每个用户代理组之间是否必须是空白的



参考文献:

b
$ b

http://www.seopt。 com / 2013/01 / robots-text-file /



http://help.yandex.com/webmaster/?id=1113851#1113858



基本上,我正在查找如何使用下面的示例中的值最终robots.txt文件应该看起来。



预先感谢。

 #只允许主要搜寻蜘蛛
用户代理:Mediapartners-Google
不允许:
抓取延迟:11

用户代理:Googlebot
不允许:
Crawl-延迟:12

用户代理:Adsbot-Google
不允许:
抓取延迟:13

用户代理:Googlebot-Image
不允许:
抓取延迟:14

用户代理:Googlebot-Mobile
不允许:
抓取延迟:15

用户代理:MSNBot
不允许:
抓取延迟:16

用户代理:bingbot
不允许:
抓取延迟:17

User-agent:Slurp
Disallow:
Crawl-delay:18

User-agent:Yahoo! Slurp
Disallow:
Crawl-delay:19

#阻止所有其他蜘蛛
用户代理:*
不允许:/

#阻止所有蜘蛛的目录
用户代理:*
禁止:/ ads /
禁止:/ cgi-bin /
禁止:/ scripts / $ b $ (4)如果我想要设置所有的用户代理爬行延迟10秒,将下列是否正确?
$ b $ pre $ code只允许主要搜索蜘蛛
用户代理*
Crawl-delay: 10

用户代理:Mediapartners-Google
不允许:

用户代理:Googlebot
不允许:

用户-agent:Adsbot-Google
禁止:

用户代理:Googlebot-Image
不允许:

用户代理:Googlebot-Mobile
不允许:

用户代理:MSNBot
不允许:

用户代理:bingbot
不允许:

User-agent:Slurp
Disallow:

User-agent:Yahoo! Slurp
Disallow:

#阻止所有其他蜘蛛
用户代理:*
禁止:/

#所有蜘蛛的目录
User-agent:*
Disallow:/ ads /
Disallow:/ cgi-bin /
Disallow:/ scripts /


解决方案


(1)每个用户代理可以拥有自己的抓取延迟吗? p>

是的。每个记录由一个或多个 User-agent 行开始,可以有一个 Crawl-delay 行。请注意, Crawl-delay 不是原始机器人的一部分.txt规范。但是将它们包含在理解它的解析器中是没有问题的,因为定义


无法识别的标题将被忽略。

因此,较旧的robots.txt解析器将简单地忽略您的 Crawl-delay 这两行。





(2)您在哪里放置每个用户代理的爬行延迟线,在Allow / Dissallow行之前或之后?

不要紧。





(3 )每个用户代理组之间是否必须是空白的。


是的。记录必须由一个或多个新行分隔。请参阅原始规范


文件由一个或多个由一个或多个空行分隔的记录组成(以CR,CR / NL或NL结尾)。





lockquote
(4)如果我想设置所有的用户代理有10秒的爬行延迟,以下是正确的吗?


没有。机器人会查找与其用户代理相匹配的记录。只有当他们没有找到记录时,他们才会使用 User-agent:* 记录。所以在你的例子中所有列出的机器人(如 Googlebot MSNBot Yahoo!Slurp 等)将有 no Crawl-delay






另外请注意,您不能使用几个记录用户代理:*


如果值是' *,该记录描述了任何未匹配任何其他记录的机器人的默认访问策略。 /robots.txt文件中不允许有多个这样的记录。

解析器可能看起来(如果没有其他记录匹配)的第一个记录与用户代理:* ,并忽略以下几个。对于第一个例子,这意味着以 / ads / / cgi-bin / 和<$ c

即使您只有一个 / em>记录与用户代理:* ,这些 Disallow 行仅适用于没有其他记录匹配!正如你的评论#所有蜘蛛的目录阻止建议,你希望这些URL路径被所有的蜘蛛阻止,所以你必须重复每个记录的 Disallow


Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file.

I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method.

Questions:

(1) Can each user agent have it's own crawl-delay? (I assume yes)

(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?

(3) Does there have to be a blank like between each user agent group.

References:

http://www.seopt.com/2013/01/robots-text-file/

http://help.yandex.com/webmaster/?id=1113851#1113858

Essentially, I am looking to find out how the final robots.txt file should look using the values in the sample below.

Thanks in advance.

# Allow only major search spiders    
User-agent: Mediapartners-Google
Disallow:
Crawl-delay: 11

User-agent: Googlebot
Disallow:
Crawl-delay: 12

User-agent: Adsbot-Google
Disallow:
Crawl-delay: 13

User-agent: Googlebot-Image
Disallow:
Crawl-delay: 14

User-agent: Googlebot-Mobile
Disallow:
Crawl-delay: 15

User-agent: MSNBot
Disallow:
Crawl-delay: 16

User-agent: bingbot
Disallow:
Crawl-delay: 17

User-agent: Slurp
Disallow:
Crawl-delay: 18

User-agent: Yahoo! Slurp
Disallow:
Crawl-delay: 19

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?

# Allow only major search spiders
User-agent: *
Crawl-delay: 10

User-agent: Mediapartners-Google
Disallow:

User-agent: Googlebot
Disallow:

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot-Image
Disallow:

User-agent: Googlebot-Mobile
Disallow:

User-agent: MSNBot
Disallow:

User-agent: bingbot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Yahoo! Slurp
Disallow:

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

解决方案

(1) Can each user agent have it's own crawl-delay?

Yes. Each record, started by one or more User-agent lines, can have a Crawl-delay line. Note that Crawl-delay is not part of the original robots.txt specification. But it’s no problem to include them for those parsers that understand it, as the spec defines:

Unrecognised headers are ignored.

So older robots.txt parsers will simply ignore your Crawl-delay lines.


(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?

Doesn’t matter.


(3) Does there have to be a blank like between each user agent group.

Yes. Records have to be separated by one or more new lines. See the original spec:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).


(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?

No. Bots look for records that match their user-agent. Only if they don’t find a record, they will use the User-agent: * record. So in your example all the listed bots (like Googlebot, MSNBot, Yahoo! Slurp etc.) will have no Crawl-delay.


Also note that you can’t have several records with User-agent: *:

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

So parsers might look (if no other record matched) for the first record with User-agent: * and ignore the following ones. For your first example that would mean that URLs beginning with /ads/, /cgi-bin/ and /scripts/ are not blocked.

And even if you have only one record with User-agent: *, those Disallow lines are only for bots that have no other record match! As your comment # Block Directories for all spiders suggest, you want these URL paths to be blocked for all spiders, so you’d have to repeat the Disallow lines for every record.

这篇关于Robots.txt - 多个用户代理的抓取延迟的正确格式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆