robots.txt 中的 User-Agent 行是完全匹配还是子字符串匹配? [英] Is the User-Agent line in robots.txt an exact match or a substring match?

查看:60
本文介绍了robots.txt 中的 User-Agent 行是完全匹配还是子字符串匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当爬虫读取 robots.txt 文件的 User-Agent 行时,它是尝试将其与自己的 User-Agent 完全匹配还是尝试将其作为其 User-Agent 的子字符串进行匹配?

When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent?

我读过的所有内容都没有明确回答这个问题.根据另一个 StackOverflow 线程,这是完全匹配的.

Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match.

然而,RFC 草案让我相信这是一个子字符串匹配.例如,User-Agent: Google 将匹配Googlebot"和Googlebot-News".以下是来自 RFC 的相关引用:

However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from the RFC:

机器人必须遵守 /robots.txt 中包含 User-Agent 行的第一条记录,该行的值包含机器人的名称标记作为子字符串.

The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.

此外,在 Googlebot 的文档 解释了 Google 图片的用户代理Googlebot-Image/1.0"与 User-Agent: googlebot 匹配.

Additionally, in the "Order of precedence for user-agents" section of Googlebot's documentation it explains that the user agent for Google Images "Googlebot-Image/1.0" match for User-Agent: googlebot.

我很感激这里的任何澄清,答案可能比我的问题更复杂.例如,Eugene Kalinin 的节点机器人模块提到在 第 29 行 并与之匹配.如果这是真的,那么 Googlebot 的用户代理 "Mozilla/5.0(兼容;Googlebot/2.​​1;+http://www.google.com/bot.html)"将与 User-Agent: Googlebot 不匹配.

I would appreciate any clarity here, and the answer may be more complicated than my question. For example, Eugene Kalinin's robots module for node mentions splitting the User-Agent to get the "name token" on line 29 and matching against that. If this is true, then Googlebot's User-Agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" will not match User-Agent: Googlebot.

推荐答案

原始 robots.txt 规范(从 1994 年),它说:

用户代理

[…]

机器人在解释这个领域时应该是自由的.建议不区分版本信息的名称的大小写子字符串匹配.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

[…]

是否以及哪些机器人/解析器遵守这是另一个问题,一般无法回答.

If and which bots/parsers comply with this is another question and can’t be answered in general.

这篇关于robots.txt 中的 User-Agent 行是完全匹配还是子字符串匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆