Robots.txt - 多个用户代理的抓取延迟的正确格式是什么？ [英] Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

查看：491 发布时间：2018/2/3 18:08:22 user format web-crawler robots.txt agents

本文介绍了Robots.txt - 多个用户代理的抓取延迟的正确格式是什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是为每个用户代理添加多个用户代理的示例 robots.txt 文件，其中多个用户代理抓取延迟值仅用于说明目的，并且在实际的robots.txt文件中会有所不同。

我在网上搜索了所有正确答案但找不到一。有太多的混合建议，我不知道哪个是正确的/正确的方法。

问题：

（1）每个用户代理是否有自己的爬行延迟？（我想是的）

（2）在Allow / Dissallow行之前或之后，你在哪里放置了每个用户代理的爬行延迟线？

（3）每个用户代理组之间是否必须是空白的

参考文献：
b
$ b
http：//www.seopt。 com / 2013/01 / robots-text-file /

http://help.yandex.com/webmaster/?id=1113851#1113858

基本上，我正在查找如何使用下面的示例中的值最终robots.txt文件应该看起来。

预先感谢。

＃只允许主要搜寻蜘蛛用户代理：Mediapartners-Google 不允许：抓取延迟：11 用户代理：Googlebot 不允许： Crawl-延迟：12 用户代理：Adsbot-Google 不允许：抓取延迟：13 用户代理：Googlebot-Image 不允许：抓取延迟：14 用户代理：Googlebot-Mobile 不允许：抓取延迟：15 用户代理：MSNBot 不允许：抓取延迟：16 用户代理：bingbot 不允许：抓取延迟：17 User-agent：Slurp Disallow： Crawl-delay：18 User-agent：Yahoo! Slurp Disallow： Crawl-delay：19 ＃阻止所有其他蜘蛛用户代理：* 不允许：/ ＃阻止所有蜘蛛的目录用户代理：* 禁止：/ ads / 禁止：/ cgi-bin / 禁止：/ scripts / $ b $ （4）如果我想要设置所有的用户代理爬行延迟10秒，将下列是否正确？ $ b $ pre $ code只允许主要搜索蜘蛛用户代理* Crawl-delay： 10 用户代理：Mediapartners-Google 不允许：用户代理：Googlebot 不允许：用户-agent：Adsbot-Google 禁止：用户代理：Googlebot-Image 不允许：用户代理：Googlebot-Mobile 不允许：用户代理：MSNBot 不允许：用户代理：bingbot 不允许： User-agent：Slurp Disallow： User-agent：Yahoo! Slurp Disallow：＃阻止所有其他蜘蛛用户代理：* 禁止：/ ＃所有蜘蛛的目录 User-agent：* Disallow：/ ads / Disallow：/ cgi-bin / Disallow：/ scripts /

解决方案

（1）每个用户代理可以拥有自己的抓取延迟吗？ p>

是的。每个记录由一个或多个 User-agent 行开始，可以有一个 Crawl-delay 行。请注意， Crawl-delay 不是原始机器人的一部分.txt规范。但是将它们包含在理解它的解析器中是没有问题的，因为定义：

无法识别的标题将被忽略。

因此，较旧的robots.txt解析器将简单地忽略您的 Crawl-delay 这两行。

（2）您在哪里放置每个用户代理的爬行延迟线，在Allow / Dissallow行之前或之后？

不要紧。

（3 ）每个用户代理组之间是否必须是空白的。

是的。记录必须由一个或多个新行分隔。请参阅原始规范：

文件由一个或多个由一个或多个空行分隔的记录组成（以CR，CR / NL或NL结尾）。

lockquote
（4）如果我想设置所有的用户代理有10秒的爬行延迟，以下是正确的吗？

没有。机器人会查找与其用户代理相匹配的记录。只有当他们没有找到记录时，他们才会使用 User-agent：* 记录。所以在你的例子中所有列出的机器人（如 Googlebot ， MSNBot ， Yahoo！Slurp 等）将有 no Crawl-delay 。

另外请注意，您不能使用几个记录用户代理：* ：

如果值是' *，该记录描述了任何未匹配任何其他记录的机器人的默认访问策略。 /robots.txt文件中不允许有多个这样的记录。

解析器可能看起来（如果没有其他记录匹配）的第一个记录与用户代理：* ，并忽略以下几个。对于第一个例子，这意味着以 / ads / ， / cgi-bin / 和<$ c

即使您只有一个 / em>记录与用户代理：* ，这些 Disallow 行仅适用于没有其他记录匹配！正如你的评论＃所有蜘蛛的目录阻止建议，你希望这些URL路径被所有的蜘蛛阻止，所以你必须重复每个记录的 Disallow 行

Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file.

I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method.

Questions:

(1) Can each user agent have it's own crawl-delay? (I assume yes)

(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?

(3) Does there have to be a blank like between each user agent group.

References:

http://www.seopt.com/2013/01/robots-text-file/

http://help.yandex.com/webmaster/?id=1113851#1113858

Essentially, I am looking to find out how the final robots.txt file should look using the values in the sample below.

Thanks in advance.
# Allow only major search spiders User-agent: Mediapartners-Google Disallow: Crawl-delay: 11 User-agent: Googlebot Disallow: Crawl-delay: 12 User-agent: Adsbot-Google Disallow: Crawl-delay: 13 User-agent: Googlebot-Image Disallow: Crawl-delay: 14 User-agent: Googlebot-Mobile Disallow: Crawl-delay: 15 User-agent: MSNBot Disallow: Crawl-delay: 16 User-agent: bingbot Disallow: Crawl-delay: 17 User-agent: Slurp Disallow: Crawl-delay: 18 User-agent: Yahoo! Slurp Disallow: Crawl-delay: 19 # Block all other spiders User-agent: * Disallow: / # Block Directories for all spiders User-agent: * Disallow: /ads/ Disallow: /cgi-bin/ Disallow: /scripts/
(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?
# Allow only major search spiders User-agent: * Crawl-delay: 10 User-agent: Mediapartners-Google Disallow: User-agent: Googlebot Disallow: User-agent: Adsbot-Google Disallow: User-agent: Googlebot-Image Disallow: User-agent: Googlebot-Mobile Disallow: User-agent: MSNBot Disallow: User-agent: bingbot Disallow: User-agent: Slurp Disallow: User-agent: Yahoo! Slurp Disallow: # Block all other spiders User-agent: * Disallow: / # Block Directories for all spiders User-agent: * Disallow: /ads/ Disallow: /cgi-bin/ Disallow: /scripts/

解决方案

(1) Can each user agent have it's own crawl-delay?

Yes. Each record, started by one or more User-agent lines, can have a Crawl-delay line. Note that Crawl-delay is not part of the original robots.txt specification. But it’s no problem to include them for those parsers that understand it, as the spec defines:

Unrecognised headers are ignored.

So older robots.txt parsers will simply ignore your Crawl-delay lines.

(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?

Doesn’t matter.

(3) Does there have to be a blank like between each user agent group.

Yes. Records have to be separated by one or more new lines. See the original spec:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).

(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?

No. Bots look for records that match their user-agent. Only if they don’t find a record, they will use the User-agent: * record. So in your example all the listed bots (like Googlebot, MSNBot, Yahoo! Slurp etc.) will have no Crawl-delay.

Also note that you can’t have several records with User-agent: *:

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

So parsers might look (if no other record matched) for the first record with User-agent: * and ignore the following ones. For your first example that would mean that URLs beginning with /ads/, /cgi-bin/ and /scripts/ are not blocked.

And even if you have only one record with User-agent: *, those Disallow lines are only for bots that have no other record match! As your comment # Block Directories for all spiders suggest, you want these URL paths to be blocked for all spiders, so you’d have to repeat the Disallow lines for every record.

这篇关于Robots.txt - 多个用户代理的抓取延迟的正确格式是什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Robots.txt - 多个用户代理的抓取延迟的正确格式是什么？ [英] Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Robots.txt - 多个用户代理的抓取延迟的正确格式是什么？ [英] Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭