如何通过Python脚本从网站上获取pdf链接 [英] How can i grab pdf links from website with Python script

查看：85 发布时间：2018/7/11 17:03:43 python website hyperlink

本文介绍了如何通过Python脚本从网站上获取pdf链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经常需要从网络上下载pdf，但有时它们不在同一页面上。
他们将链接划分为分页，我必须点击每一页获取链接。

Quite often i have to download thw pdfs from webistes but sometimes they are not on one page. They have divided the links in pagination and i have to click on every page of get the links.

我正在学习python，我想编写一些脚本我可以把weburl放在那里，它从该网站提取pdf链接。

I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.

我是python的新手所以任何人都可以给我指示我该怎么做

I am new to python so can anyone please give me the directions how can i do it

推荐答案

使用 urllib2 ， urlparse 和 lxml 。因为你是Python的新手，所以我更加冗长地评论了一些事情：

Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:

# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse

# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'

# fetch the page
res = urllib2.urlopen(base_url)

# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())

# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}

# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):

    # print the href, joining it to the base_url
    print urlparse.urljoin(base_url, node.attrib['href'])

结果：

http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...

这篇关于如何通过Python脚本从网站上获取pdf链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过Python脚本从网站上获取pdf链接 [英] How can i grab pdf links from website with Python script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何通过Python脚本从网站上获取pdf链接 [英] How can i grab pdf links from website with Python script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭