阻止包含特定词的 URLS 的谷歌机器人 [英] block google robots for URLS containing a certain word
问题描述
我的客户有很多他们不想被谷歌索引的页面——它们都被称为
my client has a load of pages which they dont want indexed by google - they are all called
http://example.com/page-xxx
所以它们是 /page-123 或 /page-2 或 /page-25 等
so they are /page-123 or /page-2 or /page-25 etc
有没有办法阻止谷歌使用 robots.txt 将任何以/page-xxx 开头的页面编入索引
Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt
这样的东西有用吗?
Disallow: /page-*
谢谢
推荐答案
首先,一行写着 Disallow:/post-*
不会做任何事情来阻止抓取/page-xxx"形式的页面.您的意思是在 Disallow 行中添加page"而不是post"?
In the first place, a line that says Disallow: /post-*
isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?
Disallow 实质上是禁止以该文本开头的网址".因此,您的示例行将禁止任何以/post-"开头的网址.(也就是说,该文件位于根目录中,其名称以post-"开头.)在这种情况下,星号是多余的,因为它暗示了这一点.
Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.
您的问题不清楚页面在哪里.如果它们都在根目录中,那么简单的 Disallow:/page-
将起作用.如果它们分散在许多不同地方的目录中,那么事情就有点困难了.
Your question is unclear as to where the pages are. If they're all in the root directory, then a simple Disallow: /page-
will work. If they're scattered across directories in many different places, then things are a bit more difficult.
正如@user728345 指出的那样,处理这个问题的最简单方法(从 robots.txt 的角度来看)是将所有您不想抓取的页面收集到一个目录中,并禁止对其进行访问.但我理解如果你不能移动所有这些页面.
As @user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.
对于 Googlebot 以及其他支持相同通配符语义的机器人(数量惊人,包括我的),以下应该有效:
For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:
禁止:/*page-
这将匹配任何地方包含page-"的任何内容.但是,这也会阻止诸如/test/thispage-123.html"之类的内容.如果您想防止这种情况发生,那么我认为(我不确定,因为我还没有尝试过)这会起作用:
That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:
禁止:*/page-
这篇关于阻止包含特定词的 URLS 的谷歌机器人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!