使用 Python 和 Selenium Webdriver 抓取 javascript [英] Scraping javascript with Python and Selenium Webdriver

查看:69
本文介绍了使用 Python 和 Selenium Webdriver 抓取 javascript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 Ask 中抓取广告,这些广告是由 Google 托管的 JS 在 iframe 中生成的.

I'm trying to scrape the ads from Ask, which are generated in an iframe by a JS hosted by Google.

当我手动导航并查看源代码时,它们就在那里(我特意寻找 ID 为adBlock"的 div,它位于 iframe 中).

When I manually navigate my way through, and view source, there they are (I'm specifically looking for a div with the id "adBlock", which is in an iframe).

但是当我尝试使用 Firefox、Chromedriver 或 FirefoxPortable 时,返回给我的源缺少我正在寻找的所有元素.

But when I try using Firefox, Chromedriver or FirefoxPortable, the source returned to me is missing all of the elements I'm looking for.

我尝试使用 urllib2 进行抓取并得到相同的结果,即使添加了必要的标头也是如此.我认为像 Webdriver 创建的物理浏览器实例肯定会解决这个问题.

I tried scraping with urllib2 and had the same results, even when adding in the necessary headers. I thought for sure that a physical browser instance like Webdriver creates would have fixed that problem.

这是我正在处理的代码,必须从几个不同的来源拼凑而成:

Here's the code I'm working off of, which had to be cobbled together from a few different sources:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pprint

# Create a new instance of the Firefox driver
driver = webdriver.Chrome('C:\Python27\Chromedriver\chromedriver.exe')
driver.get("http://www.ask.com")

print driver.title
inputElement = driver.find_element_by_name("q")

# type in the search
inputElement.send_keys("baseball hats")
# submit the form (although google automatically searches now without submitting)
inputElement.submit()

try:
    WebDriverWait(driver, 10).until(EC.title_contains("baseball"))
    print driver.title
    output = driver.page_source
    print(output)
finally:
    driver.quit()

我知道我通过一些不同的尝试来查看源代码,这不是我所关心的.

I know I circle through a few different attempts at viewing the source, that's not what I'm concerned about.

有没有想过为什么我从这个脚本中得到一个结果(省略了广告),而从它打开的浏览器得到了一个完全不同的结果(存在广告)?我尝试过 Scrapy、Selenium、Urllib2 等.不高兴.

Any thoughts as to why I'm getting one result from this script (ads omitted) and a totally different result (ads present) from the browser it opened in? I've tried Scrapy, Selenium, Urllib2, etc. No joy.

推荐答案

Selenium 只显示当前帧或 iframe 的内容.您必须使用这些方法切换到 iframes

Selenium only displays the contents of the current frame or iframe. You'll have to switch into the iframes using something along these lines

iframes = driver.find_elements_by_tag_name("iframe")

for iframe in iframes
    driver.switch_to_default_content()
    driver.switch_to_frame(iframe)

    output = driver.page_source
    print(output)

这篇关于使用 Python 和 Selenium Webdriver 抓取 javascript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆