如何自动保存网页? [英] How can I automate saving webpages?

查看:186
本文介绍了如何自动保存网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要以浏览器称为另存为,完成"的样式来存档数百个网页,这意味着它们会保存该页面本身的HTML文件以及一个文件夹,其中包含其他文件,这些文件夹可以正确呈现该页面,例如CSS,JavaScript和图像文件.这样一来,您就可以离线查看页面,就像在线显示时一样.

I need to archive several hundred webpages in the style of what browsers call "Save as, complete", meaning they save an HTML file for the page itself along with a folder full of other files needed to render the page correctly, such as CSS, JavaScript, and image files. This allows the pages to be viewed offline looking the same as when displayed online.

这是我尝试过的方法以及每个方法的问题:

Here are the methods I've tried and the problems with each:

  • Firefox中的手动过程:
  • 在下一页的链接上,右键单击.输入"A"用于保存链接位置"将目标网址复制到剪贴板.
  • 单击链接转到页面.
  • 类型"Alt-F-A"用于将页面另存为".如果尚未选择,则将保存类型"设置为否".到完整的网页".
  • 如果还不存在,则将光标置于文件名"中.输入"Ctrl-Insert"粘贴剪贴板,其中包含当前页面的URL.
  • 将光标移动到URL的末尾,然后将其移回直到到达最后的"/".从左侧选择,以选择URL的路径部分.
  • 按删除"从网址中删除路径,仅保留文件名.
  • 按"Enter"键盘上的"OK"或"OK"在对话框中.
  • 页面已保存.单击下一页的链接,重复此过程. (这假设每个页面都有一个下一个"链接,这对我要归档的页面是正确的.如果不正确,那么将有一个额外的步骤,返回包含所有链接和链接列表的页面.单击那里的下一个.)
  • Manual process in Firefox:
  • On the link for the next page, right click. Type "A" for "Save link location" to copy the destination URL to the clipboard.
  • Click the link to go to the page.
  • Type "Alt-F-A" for "Save page as". If not already selected, set "Save as type" to "Web page, complete".
  • If not already there, put the cursor in "File name". Type "Ctrl-Insert" to paste the clipboard, which contains the URL of the current page.
  • Move the cursor to the end of the URL, then move it back until it reaches the last "/". Select from there to the left to select the path part of the URL.
  • Press "Delete" to remove the path from the URL, leaving just the filename.
  • Press "Enter" on the keyboard or "Ok" in the dialog box.
  • The page is now saved. Repeat the process by clicking the link for the next page. (This assumes each page has a "Next" link, which is true for the pages I'm archiving. If that were not true, then there would be an additional step of going back to the page with a list of all links and clicking the next one from there.)

一遍又一遍地做,这很繁琐.这是我要自动化的过程.

That's pretty tedious to do over and over again. It is the process I want to automate.

  • iMacros .这种重复性任务正是宏的用途.我以前在网络浏览器中使用过iMacros来完成类似的任务,但是很长一段时间都没有使用过.我重新安装了它,弄清楚了如何再次使用它,并编写了一个单行或两行宏来保存当前页面及其URL的文件名.然后,当我尝试运行它时,iMacros告知我SaveAs命令在免费版本中不可用,我需要升级到100美元版本(具有30天的免费试用期)以获取该功能.在当前版本的软件中,我没有留下深刻的印象,并且发现它笨拙且文档记录不充分.所以我更愿意寻找另一种解决方案.

  • iMacros. This kind of repetitive task is exactly what macros are for. I've used iMacros before for similar tasks in a Web browser, but hadn't used it in a long time. I reinstalled it, figured out how to use it again, and wrote a one- or two-line macro to save the current page with the filename of its URL. Then, when I attempted to run it, iMacros informed me that the SaveAs command is not available in the free version and I need to step up to the $100 version (with a 30-day free trial) to get that capability. I was not impressed with what I'd seen in the current version of the software, and found it to be clunky and poorly documented. So I preferred to look for another solution.

Wget .这很酷. Wikipedia 将其描述为一种从Web服务器检索内容的计算机程序".这对我来说是新的,花了一段时间才弄清楚.它主要被标为Unix程序,也可用于Windows,它只是一个很小的可执行文件,不需要安装.我学到了足够的知识,可以下载一些测试页面,但是当我进入需要存档的页面时,该页面就无法工作了.我已经向Wget邮件发送了电子邮件列出有关该问题的信息,并正在等待我是否可以在一些帮助下解决它. (链接的电子邮件具有我使用的Wget命令行,包括我要存档的页面的URL,以及该页面看起来像是在线的并由Wget保存后的附件图像文件.)

Wget. This is very cool. Wikipedia describes it as "a computer program that retrieves content from web servers." It was new to me and took a while to figure out. Mainly billed as a Unix program, it's also available for Windows, and it's just a small executable file that requires no installation. I learned it enough to get it to download a few test pages, but when I went to the pages I need to archive, it didn't work on them. I've sent an e-mail to the Wget mailing list about the problem and am waiting to see if I can figure it out with some help. (The linked e-mail has the Wget command line I used, including the URL of a page I want to archive, with attached image files of what the page looks like online and after being saved by Wget.)

超过一个星期后,Wget邮件列表上没有任何回复.

As of more than a week later, there has been no reply on the Wget mailing list.

.尽管此并不是说它用于构建宏,而是测试用例". ; ,它看起来比iMacros高得多的质量.所以我尝试了一下.但是我发现它没有记录我在上面的 Firefox中的手动过程下的过程中需要做的所有事情.例如,当我右键单击链接并键入"A"时,为了存储链接的URL,Selenium并未向正在记录的算法中添加任何内容.跟随链接之后,当我保存页面时,Selenium再次没有执行任何操作.因此,尽管它看起来像高质量的软件,但它似乎并没有我需要的功能,除非我误会了某些东西.

Selenium. Although this doesn't say it's for building macros, but "test cases.", it looks like a much higher quality macro system than iMacros. So I tried it out. But I found that it doesn't record everything I need to do in the procedure under Manual process in Firefox above. For example, when I right-clicked on the link and typed "A" to store the linked URL, Selenium did not add anything to the algorithm it was recording. After following the link, when I saved the page, Selenium again did nothing. So, while it looks like quality software, it doesn't seem to have the capability I need, unless I'm misunderstanding something.

所以我很困惑.我不会做几百次手动处理.因此,我需要找到一种自动化的方法.我该怎么办?

So I'm stumped. I'm not going to do that manual process several hundred times. So I need to find a way to automate it. How can I do that?

推荐答案

该答案是我接受的另一个答案,但后来被可在Archive.org上找到.

This answer refers to another answer, which I accepted, but which was later deleted by the moderators. However, that answer was helpful, and fortunately it remains available at Archive.org.

我接受了蒂姆·范德泽(Tim Vanderzeil)的回答,因为他将我定向到了我所需的工具.现在,我想与他给我的分享一下我的所作所为.由于Kantu的问题,该解决方案仅是半自动化的,但它远比尝试手动完成要好得多.我将其发布在这里,目的是分享我所学到的知识,并看看是否有人可以提供改进,包括解决阻止完全自动化的问题的解决方案.

I've accepted Tim Vanderzeil's answer because he directed me to the tool that I needed for this. Now I want to share what I've done with what he gave me. The solution is only semi-automated because of a problem with Kantu, but it's far and away better than trying to do it all manually. I'm posting this here both to share what I've learned and to see if anyone can offer improvements, including a solution to the problem that is preventing full automation.

首先,让我提及一些技术背景,这很有趣. 坎图,尤其是其扩展名 XModules (这是我在此项目中需要的),是非常新的.制造它们的公司是成立于2016 ,而Kantu是 Mathias Roth iMacros . Kantu是不同实现,我在问题中提到的另一种工具.因此,在这个深奥的浏览器自动化领域中存在很多交叉授粉.

First, let me mention some background of the technology, which is interesting. Kantu, and especially its extension XModules (which is what I needed for this project), are pretty new. The company that makes them was founded in 2016 and Kantu was announced in September 2017. But their history is way deeper than that since its founders include Mathias Roth, the original developer of iMacros. Kantu is a different implementation of another tool I mentioned in my question, Selenium. So there's a lot of cross-pollination in this esoteric field of browser automation.

许多人已经在Stack Overflow上询问了很长时间,例如如何自动保存网页,例如 2 3 4 6 .在我看来,所有答案都无济于事.有点奇怪,因为所有浏览器都具有该功能,因此为此必须在某些地方浮动一些模块,因此我不知道为什么不能仅在PHP中为其调用函数.上面链接为#5 的问题说它出现了在浏览器中通过" Webkit ",但是知道这并没有使我有用

Many people have been asking on Stack Overflow for a long time how to automate saving of webpages, such as 1, 2, 3, 4, 5, and 6. None of the answers appear to me to be all that helpful. It's a bit strange because all browsers have the capability, so there have to be some modules floating around somewhere for this, so I don't know why I can't just call a function for it in PHP. The question linked as #5 above says it appears in browsers through "Webkit", but knowing that hasn't led me anywhere useful yet.

因此,在此之前,直到找到该PHP函数,我必须通过将Web浏览器变成机器人来做到这一点.我为下面的几本电子书开发了代码,这些电子书背后有一个我有一个合法帐户并希望保留供离线使用的付费壁垒,并且不作为PDF文件提供.我确定了两种使用Kantu下载页面的方式:

So, in the meantime, until I find that PHP function, I have to do it by turning my Web browser into a robot. I developed the code below for a few e-books behind a paywall that I have a legitimate account for and want to preserve for offline use, and that are not offered as PDF files. I determined two ways I could download the pages with Kantu:

  • 我对目录页面的HTML进行了按摩,以提取所需的URL并将其放入CSV文件中.可以通过Kantu的命令csvRead读取. URL传递给命令open以打开页面,然后命令XType发送Ctrl-S(或Alt-F-A)告诉浏览器保存页面.再次使用XType输入要另存为的文件名(URL的最后一个"之后的部分),最后的XType发送Enter结束浏览器的另存为"对话框.循环播放,即可保存该书.循环可以使用标签和命令gotoLabel在宏内部完成,也可以将宏编写为一页,然后可以在Kantu的GUI中进行循环.

  • I massaged the HTML of the tables of contents pages to extract the needed URLs and put them into CSV files. This can be read by Kantu's command csvRead. The URL is passed to command open to open the page, then command XType sends Ctrl-S (or Alt-F-A) to tell the browser to save the page. XType is used again to enter the filename to save as (the part of the URL after the last ""), and a final XType sends Enter to conclude the browser's Save-As dialog. Loop this, and the book is saved. The looping can be done either inside the macro using a label and command gotoLabel, or the macro can be written to do one page and the looping can be done in Kantu's GUI.

或者,我可以使用每页上的链接转到下一页.这是我在问题中描述的过程.我首先使用了Kantu的记录过程来获取下一页链接的标识,并将其用作以下宏的代码中的数据(特别是用作命令XClickclick的目标").我在第一个网页上启动Kantu,宏使用命令XClick右键单击下一页链接,然后使用XType发送"A"消息.到浏览器,告诉它将链接的URL复制到剪贴板.然后,推荐的click单击链接以打开页面,其余与前面的方法相同.在这里,我正在使用下一页链接来获取URL,而不是CSV文件.

Alternatively, I can use the links on each page to go to the next page. This is the process I described in my question. I first used Kantu's recording process to get the identification of the next-page link, and use that as data in the code for the macro below (specifically as the "target" of commands XClick and click). I start up Kantu on the first webpage and the macro uses command XClick to right-click the next-page link, then XType to send "A" to the browser, telling it to copy the linked URL to the clipboard. Then the commend click clicks the link to open the page, and the rest is the same as the previous method. Here, I'm using the next-page links to get the URLs instead of a CSV file.

现在,我提到在Kantu中存在一个问题,该问题无法完全自动化.由于未知原因,该过程的最后一步(发送Enter到浏览器以结束另存为"对话框)很不稳定.有时它可以工作,有时对话框只是在那儿,要求我自己按下Enter,以允许该过程继续到下一个网页.这很乏味,这意味着我需要参与该过程,而不是让它自己运行.因此,这不是完美的方法,但是比必须手动完成其余所有过程要好得多,因为数百页是不可能的.

Now, I mentioned that there is a problem in Kantu that prevents this from being fully automated. The last step of the process, sending Enter to the browser to conclude the Save-As dialog, is flaky for unknown reasons. Sometimes it works, and sometimes the dialog box just sits there, requiring me to press Enter myself to allow the process to move on to the next webpage. This is tedious and means that I need to participate in the process instead of leaving it running on its own. So, not perfect, but a whole lot better than having to do all the rest of the procedure manually as well, which would be out of the question for several hundred pages.

免费版本的XModules每次运行最多只能有25个命令.要超过该限制,需要支付一次性费用$ 50 .如果我可以让该流程独立运行,那将是非常值得的.但是由于无论如何我都必须照看它,所以我当前正在运行宏,方法是单击每个页面上的Kantu的Play macro按钮,以及注意何时需要按Enter键.

The free version of XModules has a limit of 25 commands per run. To pass that limit there is a one-time charge of $50. That would probably be well worth it if I could let the process run on its own. But since I have to babysit it anyway, I'm currently running the macro by clicking on Kantu's Play macro button for each page as well as watching for when I need to press Enter.

我已经在Kantu的论坛上发布了有关Enter问题和其他一些问题的信息.他们的团队反应迅速,乐于助人.我希望我或他们或阅读此书的人能找到解决方案.同时,半自动化过程总比没有好.

I've posted about the Enter problem and some other issues on Kantu's forum. Their team has been very responsive and helpful. I hope that I or they or someone reading this can figure out a solution. In the meantime, the semi-automated process is better than nothing.

在上述两种方法之间,仅是第二种方法,它使用下一页链接来获取URL,这些URL可以无循环运行,即,手动按Play macro即可.这就是我现在一直在使用的那个.该代码的重复次数非常少,仅为25个Ctrl-Left,作为一种解决方法,用于解决XType词汇表中意外缺少Home键以及不存在(据我发现)的问题.重复按键的命令.

Between the two methods described above, it's only the second one, using the next-page links to get the URLs, that can run without a loop, i.e., with a manual press of Play macro for each page. So that's the one I've been using for now. The code has a rather inelegant repetition of 25 Ctrl-Lefts as a workaround for the surprising absence of the Home key in XType's vocabulary, as well as the absence (as far as I've found) of a command for repeating a key-press.

这是JSON中的Kantu代码:

Here is the Kantu code, in JSON:

{"Name": "SavePageAsComplete",
 "CreationDate": "2019-01-03",
 "Commands":
  [{"Command": "comment",
    "Target":  "Macro for Kantu with XModules. Based on demo macros DemoXClick and
         DemoXType and docs https://a9t9.com/kantu/docs/xclick and https://a9t9.com/kantu/docs/xtype.
         The target in the XClick and click commands are what was obtained from
         attempting to record this macro on the website, which resulted in only an open
         command and two identical click commands with that target.",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Set play speed to 0.3 seconds. (See Kantu manual section 'Setting the right macro replay speed'.)",
    "Value":   ""
    },
   {"Command": "store",
    "Target":  "medium",
    "Value":   "!replayspeed"
    },
   {"Command": "bringBrowserToForeground",
    "Target":  "",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Right-click the link for the next page and copy its URL to the clipboard.",
    "Value":   ""
    },
   {"Command": "XClick",
    "Target":  "//*[@id=\"container\"]/div[2]/section/div[2]/a/div",
    "Value":   "#right"
    },
   {"Command": "XType",
    "Target":  "A",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Click the link for the next page. (Tried with 'clickAndWait' instead in
         order to wait for the page to load, but that yielded error 'No page load
         event detected after 10 seconds.')",
    "Value":   ""
    },
   {"Command": "click",
    "Target":  "//*[@id=\"container\"]/div[2]/section/div[2]/a/div",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Open the Save-as dialog.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_S}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Wait for the dialog to appear.",
    "Value":   ""
    },
   {"Command": "pause",
    "Target":  "2000",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Paste the clipboard (URL of now-current page) into Filename text box.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_V}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Move the cursor to the beginning of the URL. (There is no Home key!)",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Select from the beginning of the URL to the end of its path part.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Delete the selection, leaving just the filename.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_DEL}",
    "Value":   ""
    },
   {"Command": "pause",
    "Target":  "500",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Save the page.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_ENTER}",
    "Value":   ""
    }
   ]
 }

这可能会对其他想自动保存页面的人有所帮助.如果有人可以对此进行改进,也许您可​​以在评论或其他答案中说出如何做.尤其是如果您知道另存为"对话框为何无法可靠关闭的原因,并且知道如何解决此问题.

Maybe this will be of some help to other people who've been wanting to automate saving of pages. And if anyone can improve on this, maybe you could say how in a comment or another answer. Especially if you know why the Save-As dialog box doesn't close reliably, and know how to fix that.

这篇关于如何自动保存网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆