使用python/selenium保存完整的网页(包括CSS,图像) [英] Save complete web page (incl css, images) using python/selenium

查看:231
本文介绍了使用python/selenium保存完整的网页(包括CSS,图像)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python/Selenium将遗传序列提交到在线数据库,并希望保存返回的完整结果页面.以下是使我获得所需结果的代码:

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:

from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)

到那时,我有一个页面,可以手动单击另存为",并获取一个本地文件(带有image/js资产的相应文件夹),该文件可让我在本地查看整个返回的页面(减去内容向下滚动页面动态生成的,这很好).我以为有一种简单的方法可以在python/selenium中模仿这个另存为"功能,但是还没有找到.用于保存下面页面的代码仅保存html,而不会给我留下一个看起来像在Web浏览器中一样的本地文件,图像等.

At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.

content = driver.page_source
with open('webpage.html', 'w') as f:
    f.write(content)

我还找到了有关SO的问题/答案,但被接受的答案只是提出了另存为'框,并且没有提供点击它的方法(正如两个评论者所指出的那样)

I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)

有没有一种简单的方法可以使用python将[整页]另存为"?理想情况下,我更喜欢使用硒的答案,因为硒使爬网部分非常简单,但是如果有更好的工具可以使用此库,我愿意使用其他库.或者也许我只需要指定要在代码中下载的所有图像/表,就没有模拟右键单击另存为"功能的捷径?

Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?

更新-詹姆斯回答的后续问题 因此,我运行了James的代码以生成page.html(及相关文件),并将其与通过手动单击另存为得到的html文件进行了比较.通过James脚本保存的page.html很棒,可以满足我的所有需求,但是在浏览器中打开时,它还会显示很多额外的格式化文本,这些文本隐藏在手动保存的页面中.请参阅附件的屏幕截图(左侧是手动保存的页面,右侧是脚本保存的页面,带有额外的格式化文本).

UPDATE - Follow up question for James' answer So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).

这对我来说尤其令人惊讶,因为James脚本保存的页面原始html似乎表明这些字段仍应隐藏.参见例如下面的html,在两个文件中看起来都一样,但是有争议的文本仅出现在由James脚本保存的页面上的浏览器呈现的页面中:

This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:

<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST. 
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>

知道为什么会这样吗?

推荐答案

如前所述,Selenium无法与浏览器的上下文菜单进行交互以使用Save as...,因此,您可以使用诸如 pyautogui .

As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.

pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')

此代码通过其键盘快捷键CTRL+S打开Save as...窗口,然后按Enter将网页及其资产保存到默认下载位置.该代码还将文件命名为序列,以便为其赋予唯一的名称,尽管您可以针对用例进行更改.如果需要,您可以通过使用Tab键和箭头键进行一些额外的操作来另外更改下载位置.

This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.

在Ubuntu 18.10上测试;根据您的操作系统,您可能需要修改发送的组合键.

Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.

完整代码,其中还添加了条件等待以提高速度:

Full code, in which I also added conditional waits to improve speed:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()

# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')

这篇关于使用python/selenium保存完整的网页(包括CSS,图像)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆