robots.txt 只允许 root,禁止其他一切? [英] robots.txt allow root only, disallow everything else?

查看:34
本文介绍了robots.txt 只允许 root,禁止其他一切?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我似乎无法让它工作,但它似乎非常基本.

我想爬取域根目录

http://www.example.com

但没有其他需要抓取的东西,所有子目录都是动态的

http://www.example.com/*

我试过了

用户代理:*允许:/不允许:/*/

但 Google 网站管理员测试工具说允许所有子目录.

有没有人对此有解决方案?谢谢:)

解决方案

根据 Google 的 robots.txt 文档AllowDisallow 指令的顺序无关紧要.所以改变顺序真的对你没有帮助.

相反,使用 $ 运算符来指示路径的关闭.$ 表示行尾"(即从现在开始不匹配任何内容)

测试这个 robots.txt.我确定它应该适合您(我也在 Google Search Console 中进行了验证):

用户代理:*允许:/$不允许:/

这将允许 http://www.example.comhttp://www.example.com/ 被抓取,但其他所有内容都会被阻止.>

注意: Allow 指令满足您的特定用例,但如果您有 index.htmldefault.php,这些网址不会被抓取.

附注:我只真正熟悉 Googlebot 和 bingbot 的行为.如果您的目标是任何其他引擎,它们可能有也可能没有关于如何列出指令的特定规则.所以如果你想成为额外的"当然,您可以随时交换 AllowDisallow 指令块的位置,我只是将它们设置为这样来揭穿一些注释.

I can't seem to get this to work but it seems really basic.

I want the domain root to be crawled

http://www.example.com

But nothing else to be crawled and all subdirectories are dynamic

http://www.example.com/*

I tried

User-agent: *
Allow: /
Disallow: /*/

but the Google webmaster test tool says all subdirectories are allowed.

Anyone have a solution for this? Thanks :)

解决方案

According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow and Disallow directives doesn't matter. So changing the order really won't help you.

Instead, use the $ operator to indicate the closing of your path. $ means 'the end of the line' (i.e. don't match anything from this point on)

Test this robots.txt. I'm certain it should work for you (I've also verified in Google Search Console):

user-agent: *
Allow: /$
Disallow: /

This will allow http://www.example.com and http://www.example.com/ to be crawled but everything else blocked.

note: that the Allow directive satisfies your particular use case, but if you have index.html or default.php, these URLs will not be crawled.

side note: I'm only really familiar with Googlebot and bingbot behaviors. If there are any other engines you are targeting, they may or may not have specific rules on how the directives are listed out. So if you want to be "extra" sure, you can always swap the positions of the Allow and Disallow directive blocks, I just set them that way to debunk some of the comments.

这篇关于robots.txt 只允许 root,禁止其他一切?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆