Google忽略了我的robots.txt [英] Google is ignoring my robots.txt

查看:58
本文介绍了Google忽略了我的robots.txt的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的robots.txt文件的内容:

Here is content of my robots.txt file:

User-agent: *
Disallow: /images/
Disallow: /upload/
Disallow: /admin/

如您所见,我明确禁止所有机械手为文件夹 images upload admin 编制索引.问题是我的一位客户发送了删除图像文件夹中内容的请求,因为 images 文件夹中的.pdf文档出现在了Google搜索结果中.谁能解释我在这里做错了什么,以及为什么Google索引了我的文件夹?

As you can see, I explicitly disallowed all robots to index the folders images, upload and admin. The problem is that one of my clients sent request for removing the content from the images folder because .pdf document from the images folder appeared in the google search results. Can anyone explain me what I'm doing wrong here, and why google indexed my folders?

谢谢!

推荐答案

引用 Google网站管理员文件

如果我阻止Google使用robots.txt抓取网页,则不允许指令,它会从搜索结果中消失吗?

If I block Google from crawling a page using a robots.txt disallow directive, will it disappear from search results?

阻止Google抓取页面可能会减少该页面的排名或导致其随着时间的流逝而完全消失.也可能减少下面文本中提供给用户的详细信息量搜索结果.这是因为没有页面的内容,搜索引擎需要处理的信息少得多.

Blocking Google from crawling a page is likely to decrease that page's ranking or cause it to drop out altogether over time. It may also reduce the amount of detail provided to users in the text below the search result. This is because without the page's content, the search engine has much less information to work with.

-

但是,robots.txt Disallow不能保证页面不会出现在结果中:Google仍然可以根据外部相关信息,例如传入链接.如果你希望要显式地阻止页面被索引,您应该改用noindex机械手元标记或X-Robots-Tag HTTP标头.在这种情况下,您不应在robots.txt中禁止该页面,因为该页面必须进行爬网,以便可以看到和遵守标签.

However, robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.

为文件夹中的所有文件设置带有noindex的X-Robots-Tag标头.从您的Web服务器配置中为文件夹设置此标头. https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?hl = de

Set X-Robots-Tag header with noindex for all files in the folders. Set this header from your webserver config for the folders. https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?hl=de

  1. 设置来自Apache Config的pdf文件头:

  1. Set header from Apache Config for pdf files:

<文件〜"\ .pdf $">标头集X-Robots-Tag"noindex,nofollow"</Files>

禁用此文件夹的目录索引/列表.

Disable directory index'ing / listing of this folder.

添加带有"noindex"机器人元标记的空index.html.

Add a empty index.html with a "noindex" robots meta tag.

<元名称=机器人" content ="noindex,nofollow"/>< meta name ="googlebot" content ="noindex"/>

使用网站站长工具手动删除索引页.

Force the removal of the indexed pages by manually using webmaster tools.


评论中的问题:如何禁止文件夹中的所有文件?


Question in the comment: How to forbid all files in the folder?

// 1) Deny folder access completely
<Directory /var/www/denied_directory>
    Order allow,deny
</Directory>

// 2) inside the folder, place a .htaccess, denying access to all, except to index.html
Order allow,deny
Deny from all
<FilesMatch index\.html>
        Allow from all
</FilesMatch>

// 3) allow directory, but disallow specifc environment match
BrowserMatch "GoogleBot" go_away_badbot
BrowserMatch ^BadRobot/0.9 go_away_badbot

<Directory /deny_access_for_badbot>
order allow,deny
allow from all
deny from env=go_away_badbot
</Directory>  

// 4) or redirect bots to main page, sending http status 301
BrowserMatch Googlebot badbot=1
RewriteEngine on
RewriteCond %{ENV:badbot} =1
RewriteRule ^/$ /main/  [R=301,L]

这篇关于Google忽略了我的robots.txt的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆