如何使用 Python 脚本从网站获取 pdf 链接 [英] How can i grab pdf links from website with Python script

查看：42 发布时间：2021/12/15 15:26:15 python web hyperlink

本文介绍了如何使用 Python 脚本从网站获取 pdf 链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经常需要从网站下载 pdf，但有时它们不在一页上.他们在分页中划分了链接，我必须点击每个页面才能获得链接.

Quite often i have to download the pdfs from websites but sometimes they are not on one page. They have divided the links in pagination and I have to click on every page of get the links.

我正在学习 python，我想编写一些脚本，我可以在其中放置 weburl 并从该 webiste 中提取 pdf 链接.

I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.

我是 python 的新手，所以任何人都可以给我指示我该怎么做

I am new to python so can anyone please give me the directions how can i do it

推荐答案

非常简单，urllib2、urlparse 和lxml.由于您是 Python 新手，因此我对内容进行了更详细的评论:

Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:

# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse

# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'

# fetch the page
res = urllib2.urlopen(base_url)

# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())

# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}

# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, ".pdf$", "i")]', namespaces=ns):

    # print the href, joining it to the base_url
    print urlparse.urljoin(base_url, node.attrib['href'])

结果:

http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...

这篇关于如何使用 Python 脚本从网站获取 pdf 链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 Python 脚本从网站获取 pdf 链接 [英] How can i grab pdf links from website with Python script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用 Python 脚本从网站获取 pdf 链接 [英] How can i grab pdf links from website with Python script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭