Robots.txt,如何只允许访问域root,而没有更深的? [英] Robots.txt, how to allow access only to domain root, and no deeper?

查看:51
本文介绍了Robots.txt,如何只允许访问域root,而没有更深的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想允许爬虫访问我的域的根目录(即 index.html 文件),但没有更深的内容(即没有子目录).我不想在 robots.txt 文件中单独列出和拒绝每个子目录.目前我有以下内容,但我认为它阻止了所有内容,包括域根目录中的内容.

I want to allow crawlers to access my domain's root directory (i.e. the index.html file), but nothing deeper (i.e. no subdirectories). I do not want to have to list and deny every subdirectory individually within the robots.txt file. Currently I have the following, but I think it is blocking everything, including stuff in the domain's root.

User-agent: *
Allow: /$
Disallow: /

我如何编写我的 robots.txt 来完成我的目标?

How can I write my robots.txt to accomplish what I am trying for?

提前致谢!

推荐答案

没有任何东西适用于所有爬虫.有两个选项可能对您有用.

There's nothing that will work for all crawlers. There are two options that might be useful to you.

允许使用通配符的机器人应支持以下内容:

Robots that allow wildcards should support something like:

Disallow: /*/

主要的搜索引擎爬虫可以理解通配符,但不幸的是,大多数较小的爬虫不理解.

The major search engine crawlers understand the wildcards, but unfortunately most of the smaller ones don't.

如果根目录中的文件相对较少,并且不经常添加新文件,则可以使用 Allow 仅允许访问这些文件,然后使用 Disallow:/ 来限制其他一切.即:

If you have relatively few files in the root and you don't often add new files, you could use Allow to allow access to just those files, and then use Disallow: / to restrict everything else. That is:

User-agent: *
Allow: /index.html
Allow: /coolstuff.jpg
Allow: /morecoolstuff.html
Disallow: /

这里的顺序很重要.爬行者应该参加第一场比赛.因此,如果您的第一条规则是 Disallow:/,则行为正常的爬虫将无法访问以下 Allow 行.

The order here is important. Crawlers are supposed to take the first match. So if your first rule was Disallow: /, a properly behaving crawler wouldn't get to the following Allow lines.

如果抓取工具不支持 Allow,那么它会看到 Disallow:/ 并且不会抓取您网站上的任何内容.当然,前提是它忽略了 robots.txt 中它不理解的内容.

If a crawler doesn't support Allow, then it's going to see the Disallow: / and not crawl anything on your site. Providing, of course, that it ignores things in robots.txt that it doesn't understand.

所有主要的搜索引擎爬虫都支持 Allow,许多较小的爬虫也支持.实施起来很容易.

All the major search engine crawlers support Allow, and a lot of the smaller ones do, too. It's easy to implement.

这篇关于Robots.txt,如何只允许访问域root,而没有更深的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆