从网站请求完全 javascript 渲染的 html 源代码并找到所有 iframe 标签 [英] Request fully javascript rendered html source from a website and find all iframe tags
问题描述
我目前正在尝试使用 selenium 和 BeautifulSoup 从网站检索所有 iframe 标签.问题是我没有得到所有的 iframe,因为网页中有 BS4 没有搜索的内部 html 文档,我不相信 javascript 正在 HTML 中执行,所以可能有一些 HTML 元素不是得到渲染.是否有网页抓取工具可以让我请求一个 url,检索完全由 js 呈现的 HTML 文件,然后搜索 DOM 并获取所有与 iframe 匹配的标签,即使在内部 HTML 代码中也是如此.
I am currently trying to use selenium and BeautifulSoup to retrieve all iframe tags from a website. The problem is I am not getting all the iframes because there are inner html documents within the webpage that BS4 is not searching through and I don't believe the javascript is being executed within the HTML so there may be some HTML elements that aren't getting rendered. Is there a web scraping tool that would allow me to request a url, retrieve the fully js rendered HTML file then search through the DOM and get all tags matching iframe, even in the inner HTML code.
基本上我可以在 chrome 检查器工具中看到我想要的所有标签,但它们没有显示在从 BS4 中的 find_all('iframe') 函数检索的列表中.
Basically I am able to see all the tags I want within the chrome inspector tool but they are not showing up in the list retrieved from find_all('iframe') function in BS4.
这是我的代码:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
browser = webdriver.Chrome('C:/Users/G/chromedriver.exe')
browser.get("https://reddit.com")
HTML = browser.page_source
innerHTML = browser.execute_script("return document.body.innerHTML")
page = BeautifulSoup(innerHTML, 'html.parser')
for iframe in page.find_all('iframe'):
print(iframe)
browser.close()
推荐答案
您可以通过Selenium
You can get all the <iframe>
tags exclusively through Selenium
with the following code block :
from selenium import webdriver
browser = webdriver.Firefox(executable_path=r'C:UtilityBrowserDriversgeckodriver.exe')
browser.get("https://reddit.com")
frames_tag = browser.find_elements_by_tag_name("iframe")
frames_xpath = browser.find_elements_by_xpath("//iframe")
frames_css = browser.find_elements_by_css_selector("iframe")
print("Frames detected through iframe tag are %s" %frames_tag)
print("Frames detected through xpath are %s" %frames_xpath)
print("Frames detected through css are %s" %frames_css)
browser.quit()
我的控制台上的输出是:
The output on my console is :
Frames detected through iframe tag are [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ead39d06-0e39-4b40-9425-a86a1fe88d4f")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="1ce10f29-a620-4ce6-90e1-9da563046c70")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ba8493e4-8246-47a0-9ed4-3f51b8c0f133")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="19c0f134-c243-47bd-96d1-6b06ff66a011")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="95d78fa6-fb4f-4b7c-89c5-9b85965f0e4c")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="e6d2d931-1f35-432f-8825-052e244fe798")>]
Frames detected through xpath are [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ead39d06-0e39-4b40-9425-a86a1fe88d4f")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="1ce10f29-a620-4ce6-90e1-9da563046c70")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ba8493e4-8246-47a0-9ed4-3f51b8c0f133")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="19c0f134-c243-47bd-96d1-6b06ff66a011")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="95d78fa6-fb4f-4b7c-89c5-9b85965f0e4c")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="e6d2d931-1f35-432f-8825-052e244fe798")>]
Frames detected through css are [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ead39d06-0e39-4b40-9425-a86a1fe88d4f")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="1ce10f29-a620-4ce6-90e1-9da563046c70")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="ba8493e4-8246-47a0-9ed4-3f51b8c0f133")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="19c0f134-c243-47bd-96d1-6b06ff66a011")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="95d78fa6-fb4f-4b7c-89c5-9b85965f0e4c")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="98594106-54a6-4941-a6ab-cd9d92e9afa2", element="e6d2d931-1f35-432f-8825-052e244fe798")>]
这篇关于从网站请求完全 javascript 渲染的 html 源代码并找到所有 iframe 标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!