驱动程序未返回正确的页面源 [英] Driver doesn't return proper page source
问题描述
我正在尝试加载一个网页.然后滚动到该页面的最底部(无限滚动),并获取页面源代码.
I'm trying to load one web page. Then scroll to the very bottom of this page (there is an infinite scroll) and get a page source code.
滚动和加载似乎可以正常工作,但是driver.page_source
返回非常短的html
,这只是整个page source
的一小部分.
Scrolling and loading seems to work correct but driver.page_source
returns very short html
which is just a little part of the whole page source
.
def scroll_to_the_bottom(driver):
old_html = ''
new_html = driver.page_source
while old_html != new_html:
print 'SCROLL'
old_html = driver.page_source
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
new_html = driver.page_source
driver.get("http://www.citypaper.com/music/short-list/bcpnews-the-short-list-weird-al-the-heartless-bastards-chastity-belt-more-20150609-story.html")
scroll_to_the_bottom(driver)
print driver.page_source
控制台:
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" data-role="base navhead resizescroll imgsize metrics oopadloader socialshare panelmod transporter"><head><script type="text/javascript" async="" src="//ml314.com/tag.aspx?2972015"></script><script type="text/javascript" async="" src="//ml314.com/tag.aspx?2972015"></script><script async="" src="http://b.scorecardresearch.com/beacon.js"></script><script async="" src="//www.google-analytics.com/analytics.js"></script><script type="text/javascript" src="http://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script charset="UTF-8" type="text/javascript" src="http://cdn.taboola.com/libtrc/impl.174-RELEASE.js"></script><script async="" src="//widget.perfectmarket.com/tribunedigital-network/load.js"></script><script async="" src="http://b.scorecardresearch.com/beacon.js"></script>
<title>Music Boxes - Baltimore City Paper</title>
<link rel="dns-prefetch" href="//www.trbimg.com" /><link rel="dns-prefetch" href="//static.chartbeat.com" /><link rel="dns-prefetch" href="//loggingservices.tribune.com" /><link rel="dns-prefetch" href="//m.trb.com" /><link rel="dns-prefetch" href="//b.scorecardresearch.com" /><link rel="dns-prefetch" href="//www.google-analytics.com" /><link rel="dns-prefetch" href="http://pubads.g.doubleclick.net" /><link rel="dns-prefetch" href="https://securepubads.g.doubleclick.net" /><link rel="dns-prefetch" href="//secure-us.imrworldwide.com" /><link rel="dns-prefetch" href="//www.googletagservices.com" /><link rel="dns-prefetch" href="http://ssor.tribdss.com" /><link rel="dns-prefetch" href="//cdn.krxd.net" /><link rel="dns-prefetch" href="//cdn.gigya.com" /><link rel="dns-prefetch" href="//cdn.taboola.com" /><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" />
<meta charset="utf-8" />
<meta name="x-servername" content="i10latisrapp02" />
<meta name="robots" content="noodp, noydir" />
我使用chromedriver
,因此我可以清楚地看到它滚动到底部.请问哪里出问题了?
I use chromedriver
so I can clearly see that it scrolls to the bottom. Where could be the problem please?
我添加了一个抓取的网页.
I've added a web page scraped.
推荐答案
您不能依靠page_source
获取页面的当前状态. Python文档没有指出这一点,但是如果您查看Selenium的Java文档以
You cannot rely on page_source
to get the current state of the page. The Python docs do not point it out but if you look at the Java docs of Selenium for getPageSource
you'll see:
如果页面在加载后已被修改(例如,通过Javascript),则不能保证返回的文本就是修改后的页面的文本.
If the page has been modified after loading (for example, by Javascript) there is no guarantee that the returned text is that of the modified page.
您可以做的是让浏览器序列化DOM.这将在您拨打电话时生成表示DOM的HTML:
What you can do is ask the browser to serialize the DOM. This will produce HTML that represents the DOM at the time you make the call:
driver.execute_script("return document.documentElement.outerHTML")
这篇关于驱动程序未返回正确的页面源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!