driver.page_source仅返回meta name ="ROBOTS" content ="NOINDEX,NOFOLLOW";使用硒 [英] driver.page_source returns only meta name="ROBOTS" content="NOINDEX, NOFOLLOW" using Selenium

查看:87
本文介绍了driver.page_source仅返回meta name ="ROBOTS" content ="NOINDEX,NOFOLLOW";使用硒的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取一个网站,以使用以下代码获取页面内容:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Remote("http://adress:4444/wd/hub", DesiredCapabilities.CHROME)
link = 'website_url'
driver.get(link)
s = driver.page_source
print((s.encode("utf-8")))
driver.quit()

这就是收到的东西:

<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">

我也尝试了很多不同的方法,例如Luminati,代理newipnow,phantomjs,但没有用,还有什么建议可以解决呢?

解决方案

<meta name="ROBOTS" content="value">

此meta标签向不同的搜索引擎传达有关它们在特定页面上允许和不允许采取的操作的信息.该元标记可以放在<head></head>标记内的任何位置.

注意 ::由于此<meta>标签没有在整个网站范围内起作用,因此可以在同一网站的不同页面上包含不同的值.

有效的为:

  • Index(默认值)
  • Noindex
  • None
  • Follow
  • Nofollow
  • Noarchive
  • Nosnippet

这些值也可以组合以形成所需的有效元机器人标签.

示例:

  • <meta name="robots" content="noindex" />
  • <meta name="robots" content="index,follow" />
  • <meta name="robots" content="index,follow,noarchive" />

content ="NOINDEX,NOFOLLOW"

NOINDEX值表示搜索引擎为页面编制索引,因此该页面不应显示在搜索结果中. NOFOLLOW值表示搜索引擎NOT要关注或发现此页面上链接到的页面.

Web开发人员在开发网站上添加了 NOINDEX NOFOLLOW 元机器人标签,因此搜索引擎意外地不会开始向仍在建设中的网站发送流量


为什么看到?

原因可能是以下之一:

  • 您正在尝试在开发环境中执行自动测试.
  • 开发团队意外地将此标签添加到了实时网站中.
  • 开发团队上线后忘记将其从实时网站中删除.

参考

元名称"robots"标记的含义是什么


Outro

使用机器人元标记

I want to scrape one website, to get the page content with this code:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Remote("http://adress:4444/wd/hub", DesiredCapabilities.CHROME)
link = 'website_url'
driver.get(link)
s = driver.page_source
print((s.encode("utf-8")))
driver.quit()

this is what receive:

<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">

I also tried a lot of different ways, Luminati, proxy newipnow, phantomjs, but does not work, any suggestions what else i can try to solve this?

解决方案

<meta name="ROBOTS" content="value">

This meta tag conveys the different search engines about the actions they are allowed and not allowed to take on a certain page. This meta tag can be placed anywhere within the <head> and </head> tags.

Note:: As this <meta> tag does not have a site-wide effect it can contain different values on different pages of the same website.

The valid values are:

  • Index (default value)
  • Noindex
  • None
  • Follow
  • Nofollow
  • Noarchive
  • Nosnippet

These values can be combined as well to form the desired valid meta robots tag.

Example:

  • <meta name="robots" content="noindex" />
  • <meta name="robots" content="index,follow" />
  • <meta name="robots" content="index,follow,noarchive" />

content="NOINDEX, NOFOLLOW"

The NOINDEX value conveys the search engines NOT to index the page, so the page should not show up in search results. The NOFOLLOW value conveys the search engines NOT to follow or discover the pages that are LINKED TO on this page.

Web developers adds the NOINDEX , NOFOLLOW meta robots tag on development websites, so the search engines accidentally doesn't start sending traffic to a website that is still under construction.


Why are you seeing?

The reason can be either of the following:

  • You are trying to execute your auomated tests within Development Environment.
  • Development Team have accidentally added this tag to live website.
  • Development Team have forgot to remove it from live websites after going live.

Reference

What is the meaning of the meta name "robots" tag


Outro

Using the robots meta tag

这篇关于driver.page_source仅返回meta name ="ROBOTS" content ="NOINDEX,NOFOLLOW";使用硒的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆