使用 python/selenium 保存完整的网页(包括 css、图像) [英] Save complete web page (incl css, images) using python/selenium

查看：56 发布时间：2021/12/17 13:59:30 python selenium web-scraping web-crawler bioinformatics

本文介绍了使用 python/selenium 保存完整的网页(包括 css、图像)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Python/Selenium 将基因序列提交到在线数据库，并且想要保存我返回的整页结果.下面是让我得到我想要的结果的代码:

from selenium import webdriverURL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' # 'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # 为你的机器更新这个# 用硒打开页面#(首先需要下载Chrome webdriver，或者firefox webdriver等)驱动程序 = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)driver.get(URL)时间.sleep(5)# 在查询字段中输入序列并点击blast"按钮进行搜索seq_query_field = driver.find_element_by_id("seq")seq_query_field.send_keys(SEQUENCE)blast_button = driver.find_element_by_id("b1")blast_button.click()时间.睡眠(60)

那时我有一个页面，我可以手动单击另存为"，并获取一个本地文件(带有相应的图像/js 资产文件夹)，让我可以在本地查看整个返回的页面(减去内容通过向下滚动页面动态生成，这很好).我认为有一种简单的方法可以在 python/selenium 中模拟这个另存为"函数，但还没有找到.下面保存页面的代码只是保存了 html，并没有给我留下一个本地文件，就像在网络浏览器中一样，带有图像等.

content = driver.page_sourcewith open('webpage.html', 'w') as f:f.写(内容)

我还找到了

这让我特别惊讶，因为 James 脚本保存的页面的原始 html 似乎表明这些字段仍应隐藏.见例如下面的 html，在两个文件中显示相同，但有问题的文本仅出现在 James 脚本保存的浏览器呈现的页面中:

<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">这些选项控制结果页面中对齐的格式.这默认为 HTML，但也可以使用其他格式(包括纯文本).PSSM 和 PssmWithParameters 是位置特定评分矩阵的表示，仅适用于 PSI-BLAST.高级视图选项允许按表中的各种索引对数据库描述进行排序.</p>

知道为什么会这样吗?

解决方案

正如您所指出的，Selenium 无法与浏览器的上下文菜单交互以使用 另存为...，因此改为这样做，您可以使用外部自动化库，例如 pyautogui.

pyautogui.hotkey('ctrl', 's')时间.sleep(1)pyautogui.typewrite(SEQUENCE + '.html')pyautogui.hotkey('回车')

此代码通过其键盘快捷键CTRL+S打开另存为...窗口，然后按回车将网页及其资产保存到默认下载位置.此代码还将文件命名为序列，以便为其提供唯一名称，但您可以针对您的用例更改此名称.如果需要，您还可以使用 Tab 键和箭头键通过一些额外的工作来更改下载位置.

在 Ubuntu 18.10 上测试；根据您的操作系统，您可能需要修改发送的组合键.

<小时>

完整代码，其中我还添加了条件等待以提高速度:

导入时间从硒导入网络驱动程序from selenium.webdriver.common.by import By从 selenium.webdriver.support.expected_conditions 导入可见性_of_element_located从 selenium.webdriver.support.ui 导入 WebDriverWait导入pyautoguiURL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' # 'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'# 用硒打开页面#(首先需要下载Chrome webdriver，或者firefox webdriver等)驱动程序 = webdriver.Chrome()driver.get(URL)# 在查询字段中输入序列并点击blast"按钮进行搜索seq_query_field = driver.find_element_by_id("seq")seq_query_field.send_keys(SEQUENCE)blast_button = driver.find_element_by_id("b1")blast_button.click()# 等待直到加载结果WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))# 打开另存为..."以保存 html 和资产pyautogui.hotkey('ctrl', 's')时间.sleep(1)pyautogui.typewrite(SEQUENCE + '.html')pyautogui.hotkey('回车')

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:

from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)

At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.

content = driver.page_source
with open('webpage.html', 'w') as f:
    f.write(content)

I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)

Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?

UPDATE - Follow up question for James' answer So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).

This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:

<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST. 
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>

Any idea why this is happening?

解决方案

As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.

pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')

This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.

Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.

Full code, in which I also added conditional waits to improve speed:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()

# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')

这篇关于使用 python/selenium 保存完整的网页(包括 css、图像)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 python/selenium 保存完整的网页(包括 css、图像) [英] Save complete web page (incl css, images) using python/selenium

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 python/selenium 保存完整的网页(包括 css、图像) [英] Save complete web page (incl css, images) using python/selenium

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭