如何在没有.pdf的链接中下载ruby pdf文件 [英] How to download pdf file in ruby without .pdf in the link

查看:143
本文介绍了如何在没有.pdf的链接中下载ruby pdf文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从不提供使用ruby的(.pdf)结尾的链接的网站下载pdf。手动,当我点击链接下载pdf,它需要我到一个新的页面,对话框保存/打开文件出现一段时间后。

I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.

请帮助我下载文件。

链接

推荐答案

如果你只需要一个简单的ruby脚本来做,我只需要运行 wget 。像这样 exec'wgethttp://path.to.the.file/and/some/params'

If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'

然而,在这一点上,你也可以运行wget。

At that point though, you might as well run wget.

另一种方法是在你知道pdf的页面上运行一个get

The other way, is to just run a get on the page that you know the pdf is at

source = Net :: HTTP.get(http://the.website.com,/ and / some / params )

您可以使用许多其他http客户端,但只要您创建一个 请求到pdf的端点,它应该给你原始数据。那么你可以重新命名这个文件,你会有pdf

There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf

在你的情况下,我运行了以下命令来获取pdf

In your case, I ran the following commands to get the pdf

wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf

然后打开pdf。请注意,这些是linux命令。如果你想获得一个ruby脚本的文件,你可以使用我之前提到的东西。

Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.

更新:

有一个额外的复杂性,最初没有说明,这是每当pdf更新时,pdf的url都会更改。为了使这项工作,您可能想要做一些涉及网络刮擦的事情。我建议 nokogiri 。这样,您可以查看下载的页面,然后在所需的URL上执行获取请求。此外,托管pdf的服务器配置错误,打开页面后几秒钟内就会打开Chrome。

There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.

如何解决这个问题:我去了网站,并刷新它。然后断开与服务器的连接(按X,否则将是刷新按钮)。然后右键单击下载链接,然后选择 inspect元素。然后浏览dom来找到一些确定的标识(像id)。幸运的是,我发现了一些< strong id =telecharger>下载和LT; /强> 。这意味着你可以使用类似于 page.css('strong#telecharger')的东西[0] .parent ['href'] 这应该给你一个URL。然后,您可以按照上述方式执行获取请求。我没有时间为你做的脚本(太多的工作要做),但这应该足以解决问题。

How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.

这篇关于如何在没有.pdf的链接中下载ruby pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆