如何使用 Python 脚本从网站获取 pdf 链接 [英] How can i grab pdf links from website with Python script

查看:42
本文介绍了如何使用 Python 脚本从网站获取 pdf 链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常需要从网站下载 pdf,但有时它们不在一页上.他们在分页中划分了链接,我必须点击每个页面才能获得链接.

Quite often i have to download the pdfs from websites but sometimes they are not on one page. They have divided the links in pagination and I have to click on every page of get the links.

我正在学习 python,我想编写一些脚本,我可以在其中放置 weburl 并从该 webiste 中提取 pdf 链接.

I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.

我是 python 的新手,所以任何人都可以给我指示我该怎么做

I am new to python so can anyone please give me the directions how can i do it

推荐答案

非常简单,urllib2urlparselxml.由于您是 Python 新手,因此我对内容进行了更详细的评论:

Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:

# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse

# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'

# fetch the page
res = urllib2.urlopen(base_url)

# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())

# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}

# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, ".pdf$", "i")]', namespaces=ns):

    # print the href, joining it to the base_url
    print urlparse.urljoin(base_url, node.attrib['href'])

结果:

http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...

这篇关于如何使用 Python 脚本从网站获取 pdf 链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆