通过Selenium抓取动态内容? [英] Scraping dynamic content through Selenium?

查看:71
本文介绍了通过Selenium抓取动态内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过Selenium从 Blog 中删除动态内容,但它总是返回未呈现的JavaScript.

I'm trying to scrap dynamic content from a Blog through Selenium but it always returns un rendered JavaScript.

为了测试这种行为,我尝试等到iframe完全加载并打印出可以正常打印的内容,但是当我再次移回父框架时,它只会显示未渲染的JavaScript.

To test this behavior I tried to wait till iframe loads completely and printed it's content which prints fine but again when I move back to parent frame it just displays un rendered JavaScript.

我正在寻找能够打印完全呈现的HTML内容的东西

I'm looking for something in which I'm able to print completely rendered HTML content

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions

driver = webdriver.Chrome("path to chrome driver")   
driver.get('http://justgivemechocolateandnobodygetshurt.blogspot.com/')

WebDriverWait(driver, 40).until(expected_conditions.frame_to_be_available_and_switch_to_it((By.ID, "navbar-iframe")))

# Rendered iframe HTML is printed.
content = driver.page_source
print content.encode("utf-8")

# When I switch back to parent frame it again prints non rendered JavaScript.
driver.switch_to.parent_frame()
content = driver.page_source
print content.encode("utf-8")

推荐答案

问题是- .page_source 仅在当前上下文中有效.存在当前顶级浏览上下文" 表示法.意思是,如果要在默认内容上调用它-您将不会获得子 iframe 元素的内部HTML-为此,您必须切换到 frame 的上下文,然后调用 .page_source .

The problem is - the .page_source works only in the current context. There is that "current top-level browsing context" notation..Meaning, if you would call it on a default content - you would not get the inner HTML of the child iframeelements - for that you would have to switch into the context of a frame and call .page_source.

换句话说,要获得包含iframe页面源代码的页面的非常完整的HTML,您必须一一切换到iframe上下文中并分别获取源代码.

In other words, to get the very complete HTML of the page including the page source of the iframes, you would have to switch into the iframe contexts one by one and get the sources separately.

另请参阅:

旧答案:

我会等待至少要载入一个博客条目在获取 page_source 之前:

I would wait for at least one blog entry to be loaded before getting the page_source:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 40)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".entry-content")))

print(driver.page_source)

这篇关于通过Selenium抓取动态内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆