robots.txt 中的 User-Agent 行是完全匹配还是子字符串匹配? [英] Is the User-Agent line in robots.txt an exact match or a substring match?

查看：60 发布时间：2021/7/10 19:17:44 web-crawler user-agent robots.txt

本文介绍了robots.txt 中的 User-Agent 行是完全匹配还是子字符串匹配?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当爬虫读取 robots.txt 文件的 User-Agent 行时，它是尝试将其与自己的 User-Agent 完全匹配还是尝试将其作为其 User-Agent 的子字符串进行匹配?

When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent?

我读过的所有内容都没有明确回答这个问题.根据另一个 StackOverflow 线程，这是完全匹配的.

Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match.

然而，RFC 草案让我相信这是一个子字符串匹配.例如，User-Agent: Google 将匹配Googlebot"和Googlebot-News".以下是来自 RFC 的相关引用:

However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from the RFC:

机器人必须遵守 /robots.txt 中包含 User-Agent 行的第一条记录，该行的值包含机器人的名称标记作为子字符串.

The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.

此外，在 Googlebot 的文档解释了 Google 图片的用户代理Googlebot-Image/1.0"与 User-Agent: googlebot 匹配.

Additionally, in the "Order of precedence for user-agents" section of Googlebot's documentation it explains that the user agent for Google Images "Googlebot-Image/1.0" match for User-Agent: googlebot.

我很感激这里的任何澄清，答案可能比我的问题更复杂.例如，Eugene Kalinin 的节点机器人模块提到在第 29 行并与之匹配.如果这是真的，那么 Googlebot 的用户代理 "Mozilla/5.0(兼容；Googlebot/2.1；+http://www.google.com/bot.html)"将与 User-Agent: Googlebot 不匹配.

I would appreciate any clarity here, and the answer may be more complicated than my question. For example, Eugene Kalinin's robots module for node mentions splitting the User-Agent to get the "name token" on line 29 and matching against that. If this is true, then Googlebot's User-Agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" will not match User-Agent: Googlebot.

robots.txt 中的 User-Agent 行是完全匹配还是子字符串匹配? [英] Is the User-Agent line in robots.txt an exact match or a substring match?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

robots.txt 中的 User-Agent 行是完全匹配还是子字符串匹配? [英] Is the User-Agent line in robots.txt an exact match or a substring match?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭