驱动程序未返回正确的页面源 [英] Driver doesn't return proper page source

查看:98
本文介绍了驱动程序未返回正确的页面源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加载一个网页.然后滚动到该页面的最底部(无限滚动),并获取页面源代码.

I'm trying to load one web page. Then scroll to the very bottom of this page (there is an infinite scroll) and get a page source code.

滚动和加载似乎可以正常工作,但是driver.page_source返回非常短的html,这只是整个page source的一小部分.

Scrolling and loading seems to work correct but driver.page_source returns very short html which is just a little part of the whole page source.

def scroll_to_the_bottom(driver):
    old_html = ''
    new_html = driver.page_source
    while old_html != new_html:
        print 'SCROLL'
        old_html = driver.page_source
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)
        new_html = driver.page_source


driver.get("http://www.citypaper.com/music/short-list/bcpnews-the-short-list-weird-al-the-heartless-bastards-chastity-belt-more-20150609-story.html")
scroll_to_the_bottom(driver)
print driver.page_source

控制台:

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" data-role="base navhead resizescroll imgsize metrics oopadloader socialshare panelmod transporter"><head><script type="text/javascript" async="" src="//ml314.com/tag.aspx?2972015"></script><script type="text/javascript" async="" src="//ml314.com/tag.aspx?2972015"></script><script async="" src="http://b.scorecardresearch.com/beacon.js"></script><script async="" src="//www.google-analytics.com/analytics.js"></script><script type="text/javascript" src="http://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script charset="UTF-8" type="text/javascript" src="http://cdn.taboola.com/libtrc/impl.174-RELEASE.js"></script><script async="" src="//widget.perfectmarket.com/tribunedigital-network/load.js"></script><script async="" src="http://b.scorecardresearch.com/beacon.js"></script>
<title>Music Boxes - Baltimore City Paper</title>

      <link rel="dns-prefetch" href="//www.trbimg.com" /><link rel="dns-prefetch" href="//static.chartbeat.com" /><link rel="dns-prefetch" href="//loggingservices.tribune.com" /><link rel="dns-prefetch" href="//m.trb.com" /><link rel="dns-prefetch" href="//b.scorecardresearch.com" /><link rel="dns-prefetch" href="//www.google-analytics.com" /><link rel="dns-prefetch" href="http://pubads.g.doubleclick.net" /><link rel="dns-prefetch" href="https://securepubads.g.doubleclick.net" /><link rel="dns-prefetch" href="//secure-us.imrworldwide.com" /><link rel="dns-prefetch" href="//www.googletagservices.com" /><link rel="dns-prefetch" href="http://ssor.tribdss.com" /><link rel="dns-prefetch" href="//cdn.krxd.net" /><link rel="dns-prefetch" href="//cdn.gigya.com" /><link rel="dns-prefetch" href="//cdn.taboola.com" /><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" />
    <meta charset="utf-8" />
    <meta name="x-servername" content="i10latisrapp02" />

      <meta name="robots" content="noodp, noydir" />

我使用chromedriver,因此我可以清楚地看到它滚动到底部.请问哪里出问题了?

I use chromedriver so I can clearly see that it scrolls to the bottom. Where could be the problem please?

我添加了一个抓取的网页.

I've added a web page scraped.

推荐答案

您不能依靠page_source获取页面的当前状态. Python文档没有指出这一点,但是如果您查看Selenium的Java文档以

You cannot rely on page_source to get the current state of the page. The Python docs do not point it out but if you look at the Java docs of Selenium for getPageSource you'll see:

如果页面在加载后已被修改(例如,通过Javascript),则不能保证返回的文本就是修改后的页面的文本.

If the page has been modified after loading (for example, by Javascript) there is no guarantee that the returned text is that of the modified page.

您可以做的是让浏览器序列化DOM.这将在您拨打电话时生成表示DOM的HTML:

What you can do is ask the browser to serialize the DOM. This will produce HTML that represents the DOM at the time you make the call:

driver.execute_script("return document.documentElement.outerHTML")

这篇关于驱动程序未返回正确的页面源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆