如何通过Python脚本从网站上获取pdf链接 [英] How can i grab pdf links from website with Python script
问题描述
我经常需要从网络上下载pdf,但有时它们不在同一页面上。
他们将链接划分为分页,我必须点击每一页获取链接。
Quite often i have to download thw pdfs from webistes but sometimes they are not on one page. They have divided the links in pagination and i have to click on every page of get the links.
我正在学习python,我想编写一些脚本我可以把weburl放在那里,它从该网站提取pdf链接。
I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.
我是python的新手所以任何人都可以给我指示我该怎么做
I am new to python so can anyone please give me the directions how can i do it
推荐答案
使用 urllib2
, urlparse
和 lxml
。因为你是Python的新手,所以我更加冗长地评论了一些事情:
Pretty simple with urllib2
, urlparse
and lxml
. I've commented things more verbosely since you're new to Python:
# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse
# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'
# fetch the page
res = urllib2.urlopen(base_url)
# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())
# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}
# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):
# print the href, joining it to the base_url
print urlparse.urljoin(base_url, node.attrib['href'])
结果:
http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...
这篇关于如何通过Python脚本从网站上获取pdf链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!