driver.page_source仅返回meta name ="ROBOTS" content ="NOINDEX,NOFOLLOW";使用硒 [英] driver.page_source returns only meta name="ROBOTS" content="NOINDEX, NOFOLLOW" using Selenium
问题描述
我想抓取一个网站,以使用以下代码获取页面内容:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Remote("http://adress:4444/wd/hub", DesiredCapabilities.CHROME)
link = 'website_url'
driver.get(link)
s = driver.page_source
print((s.encode("utf-8")))
driver.quit()
这就是收到的东西:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
我也尝试了很多不同的方法,例如Luminati,代理newipnow,phantomjs,但没有用,还有什么建议可以解决呢?
<meta name="ROBOTS" content="value">
此meta标签向不同的搜索引擎传达有关它们在特定页面上允许和不允许采取的操作的信息.该元标记可以放在<head>
和</head>
标记内的任何位置.
注意 ::由于此<meta>
标签没有在整个网站范围内起作用,因此可以在同一网站的不同页面上包含不同的值.
有效的值为:
-
Index
(默认值) -
Noindex
-
None
-
Follow
-
Nofollow
-
Noarchive
-
Nosnippet
这些值也可以组合以形成所需的有效元机器人标签.
示例:
-
<meta name="robots" content="noindex" />
-
<meta name="robots" content="index,follow" />
-
<meta name="robots" content="index,follow,noarchive" />
content ="NOINDEX,NOFOLLOW"
NOINDEX
值表示搜索引擎不为页面编制索引,因此该页面不应显示在搜索结果中. NOFOLLOW
值表示搜索引擎NOT
要关注或发现此页面上链接到的页面.
Web开发人员在开发网站上添加了 NOINDEX , NOFOLLOW 元机器人标签,因此搜索引擎意外地不会开始向仍在建设中的网站发送流量
为什么看到?
原因可能是以下之一:
- 您正在尝试在开发环境中执行自动测试.
- 开发团队意外地将此标签添加到了实时网站中.
- 开发团队上线后忘记将其从实时网站中删除.
参考
Outro
使用机器人元标记 >
I want to scrape one website, to get the page content with this code:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Remote("http://adress:4444/wd/hub", DesiredCapabilities.CHROME)
link = 'website_url'
driver.get(link)
s = driver.page_source
print((s.encode("utf-8")))
driver.quit()
this is what receive:
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
I also tried a lot of different ways, Luminati, proxy newipnow, phantomjs, but does not work, any suggestions what else i can try to solve this?
<meta name="ROBOTS" content="value">
This meta tag conveys the different search engines about the actions they are allowed and not allowed to take on a certain page. This meta tag can be placed anywhere within the <head>
and </head>
tags.
Note:: As this <meta>
tag does not have a site-wide effect it can contain different values on different pages of the same website.
The valid values are:
Index
(default value)Noindex
None
Follow
Nofollow
Noarchive
Nosnippet
These values can be combined as well to form the desired valid meta robots tag.
Example:
<meta name="robots" content="noindex" />
<meta name="robots" content="index,follow" />
<meta name="robots" content="index,follow,noarchive" />
content="NOINDEX, NOFOLLOW"
The NOINDEX
value conveys the search engines NOT to index the page, so the page should not show up in search results. The NOFOLLOW
value conveys the search engines NOT
to follow or discover the pages that are LINKED TO on this page.
Web developers adds the NOINDEX , NOFOLLOW meta robots tag on development websites, so the search engines accidentally doesn't start sending traffic to a website that is still under construction.
Why are you seeing?
The reason can be either of the following:
- You are trying to execute your auomated tests within Development Environment.
- Development Team have accidentally added this tag to live website.
- Development Team have forgot to remove it from live websites after going live.
Reference
What is the meaning of the meta name "robots" tag
Outro
这篇关于driver.page_source仅返回meta name ="ROBOTS" content ="NOINDEX,NOFOLLOW";使用硒的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!