如何下载链接刮[python]的PDF文件? [英] How to Download PDFs from Scraped Links [Python]?

查看:218
本文介绍了如何下载链接刮[python]的PDF文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作PDF网站刮板Python编写的。从本质上讲,我试图刮掉所有的讲义,从我的课程,这是在PDF的形式之一。我想输入一个URL,然后得到的PDF文件,并​​将它们保存在我的笔记本电脑的目录。我看过几个教程,但我不完全知道如何去这样做。对StackOverflow的问题似乎都不需要任何帮助我。

I'm working on making a PDF Web Scraper in Python. Essentially, I'm trying to scrape all of the lecture notes from one of my courses, which are in the form of PDFs. I want to enter a url, and then get the PDFs and save them in a directory in my laptop. I've looked at several tutorials, but I'm not entirely sure how to go about doing this. None of the questions on StackOverflow seem to be helping me either.

下面是我到目前为止有:

Here is what I have so far:

import requests
from bs4 import BeautifulSoup
import shutil

bs = BeautifulSoup

url = input("Enter the URL you want to scrape from: ")
print("")

suffix = ".pdf"

link_list = []

def getPDFs():    
    # Gets URL from user to scrape
    response = requests.get(url, stream=True)
    soup = bs(response.text)

    #for link in soup.find_all('a'): # Finds all links
     #   if suffix in str(link): # If the link ends in .pdf
      #      link_list.append(link.get('href'))
    #print(link_list)

    with open('CS112.Lecture.09.pdf', 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)
    del response
    print("PDF Saved")

getPDFs()

本来,我已经得到了所有的链接到PDF文件中,但不知道如何下载它们;在code表示,现在被注释掉了。

Originally, I had gotten all of the links to the PDFs, but did not know how to download them; the code for that is now commented out.

现在我已经得到的地方我想只下载一个PDF点;和一个PDF并获取下载,但它是一个0KB文件。

Now I've gotten to the point where I'm trying to download just one PDF; and a PDF does get downloaded, but it's a 0KB file.

如果它有什么用途,我使用Python 3.4.2

If it's of any use, I'm using Python 3.4.2

推荐答案

如果这是不需要被记录的东西,你可以使用的 urlretrieve()

If this is something that does not require being logged in, you can use urlretrieve():

from urllib.request import urlretrieve

for link in link_list:
    urlretrieve(link)

这篇关于如何下载链接刮[python]的PDF文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆