robots.txt 禁止除一个页面之外的所有页面?它们是否覆盖和级联? [英] robots.txt to disallow all pages except one? Do they override and cascade?

查看:49
本文介绍了robots.txt 禁止除一个页面之外的所有页面?它们是否覆盖和级联?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望抓取我网站的一个页面,而不是其他页面.

I want one page of my site to be crawled and no others.

另外,如果它与上面的答案有任何不同,我还想知道除了网站的根(索引)之外的所有内容的语法.

# robots.txt for http://example.com/

User-agent: *
Disallow: /style-guide
Disallow: /splash
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc

或者我可以这样做吗?

# robots.txt for http://example.com/

User-agent: *
Disallow: /
Allow: /under-construction

另外我应该提到这是一个 WordPress 安装,所以建设中",例如,被设置为首页.因此,在这种情况下,它充当索引.

Also I should mention that this is a WordPress install, so "under-construction," for example, is set to the front page. So in that case it acts as the index.

我认为我需要的是 http://example.com 抓取,但没有其他页面.

I think what I need is to have http://example.com craweld, but no other pages.

# robots.txt for http://example.com/

User-agent: *
Disallow: /*

这是否意味着在根之后不允许任何东西?

Would this mean disallow anything after the root?

推荐答案

允许只访问一页的最简单方法是:

The easiest way to allow access to just one page would be:

User-agent: *
Allow: /under-construction
Disallow: /

原始robots.txt规范说爬虫应该从上到下读取robots.txt,并使用第一个匹配规则.如果您将 Disallow 放在首位,那么许多机器人会认为它说他们无法抓取任何东西.通过将 Allow 放在首位,那些从上到下应用规则的人将看到他们可以访问该页面.

The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the Disallow first, then many bots will see it as saying they can't crawl anything. By putting the Allow first, those that apply the rules from top to bottom will see that they can access that page.

表达式规则很简单:表达式Disallow:/ 表示禁止任何 斜杠开头的内容."所以这意味着网站上的所有内容.

The expression rules are simple: the expression Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.

您的 Disallow:/* 对 Googlebot 和 Bingbot 的意义相同,但不支持通配符的机器人可以看到 /* 并认为您的意思是文字 *.所以他们可以假设可以抓取 /*foo/bar.html.

Your Disallow: /* means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the /* and think that you meant a literal *. So they could assume that it was okay to crawl /*foo/bar.html.

如果您只想抓取 http://example.com,而不想抓取其他内容,您可以尝试:

If you just want to crawl http://example.com, but nothing else, you might try:

Allow: /$
Disallow: /

$ 表示字符串结束",就像在正则表达式中一样.同样,这适用于 Google 和 Bing,但不适用于其他不支持通配符的抓取工具.

The $ means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.

这篇关于robots.txt 禁止除一个页面之外的所有页面?它们是否覆盖和级联?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆