使用python下载文件(REST URL) [英] download files with python (REST URL)

查看:457
本文介绍了使用python下载文件(REST URL)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这里是GET请求:

$

b
$ b

  GET / test / download / id / 5774 / format / testTitle HTTP / 1.1 
主机:testServer.com
User-代理:Mozilla的/ 5.0(Windows NT的6.1; WOW64; RV:23.0)的Gecko / 20100101火狐/ 23.0
接受:text / html的,是application / xhtml + xml的,应用/ XML; q = 0.9 * / *; q = 0.8
接受语言:EN-US,EN; q = 0.5
接受编码:gzip,紧缩
曲奇:__utma = 11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb = 11863783.28.9.1379790533699; __utmc = 11863783; __utmz = 11863783.1379789243.1.1.utmcsr =(direct)| utmccn =(direct)| utmcmd =(none); PHPSESSID = fa844952890e9091d968c541caa6965f; loginremember = Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage = 1; ma-pref = KLSFKJSJSD897897; skipPostLogin = 0; pp-sid = hlh6hs1pnvuh571ar159t5pao0; __utmv = 11863783。| 1 = MemberType = Yearly = 1; nats_cookie = http%253A%252F%252Fwww.testServer.com%252F; nats = NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess = fe3f77e6e326eb8d18ef0111ab6f322e; __utma = 163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb = 163815075.1.9.1379790485255; __utmc = 163815075; __utmz = 163815075.1379790355.1.1.utmcsr = ppp.contentdef.com | utmccn =(referral)| utmcmd = referral | utmcct = / postlogin; unlockedNetworks =%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
连接:close

如果请求是好的,它将返回一个302响应,如下面的响应:

  HTTP /1.1 302 Found 
Date:Sat,21 Sep 2013 19:32:37 GMT
服务器:Apache
到期时间:Thu,19 Nov 1981 08:52:00 GMT
Cache --ontrol:no-store,no-cache,must-revalidate,post-check = 0,pre-check = 0
Pragma:no-cache
location:http://downloads.test.stuff ?.COM / 5774 /东西/ picture.jpg结婚= 20130921152237&安培; WER = 20130922153237&安培;哈希= 0f20f4a6d0c9f1720b0b6
各不相同:用户代理,接受编码
的Content-Length:0
连接方式:关闭
Content-Type:text / html; charset = UTF-8

我需要脚本做的是检查它是否是一个302响应。如果不是,它将通过,如果是,它将需要解析出如下所示的位置参数:

 地点:http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6 

一旦我有location参数,我将不得不另一个GET请求下载该文件。我也必须维护我的会话的cookie为了下载文件。



有人可能指向我正确的方向为什么库最好使用这个?我无法找到如何解析302响应,并添加一个cookie值,如我上面的GET请求中显示的值。我确信必须有一些库可以做到这一切。



任何帮助将非常感激。

解决方案

 导入urllib.request里为UR 
进口urllib.error里为UE

'''
请注意,http.client.HTTPResponse.read([amt])读取并返回响应正文,或者最多
下一个amt字节。这是因为urlopen()没有办法自动确定
它从http服务器接收的字节流的编码。
'''

url =http://www.example.org/images/{}.jpg

dst =
arr = [01,02,03,04,05,06,07,08,09]
#arr = ,20)
try:
for x in arr:
print(str(x)+).ljust(4),end =)
hrio = ur.urlopen(url.format(x))#HTTPResponse iterable对象(以字节形式返回响应头和body)
fh = open(dst + str(x)+。jpg,b + w)
fh.write(hrio.read())
fh.close()
print(\t [REQUEST COMPLETE] \t\t& [None]>)
except ue.URLError as e:
print(\t [REQUEST INCOMPLETE] \t,end =)
print ;错误〜[{}]>。格式(e))


I am trying to write a script that will download a bunch files from a website that has REST URLs.

Here is the GET request:

GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close

If the request is good, it will return a 302 response such as this one:

HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

What I need the script to do is check to see if it was a 302 response. If it is not, it will "pass", if it is, it will need to parse out the location parameter shown here:

location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6

Once I have the location parameter, I will have to make another GET request to download that file. I will also have to maintain the cookie for my session in order to download the file.

Can someone point me in the right direction for what library is best to use for this? I am having trouble finding out how to parse the 302 response and adding a cookie value like the one shown in my GET request above. I am sure there must be some library that can do all of this.

Any help would be much appreciated.

解决方案

import urllib.request as ur
import urllib.error as ue

'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to 
the next amt bytes. This is because there is no way for urlopen() to automatically determine 
the encoding of the byte stream it receives from the http server. 
'''

url = "http://www.example.org/images/{}.jpg"

dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
    for x in arr:
        print(str(x)+"). ".ljust(4),end="")
        hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
        fh = open(dst+str(x)+".jpg","b+w")
        fh.write(hrio.read())
        fh.close()
        print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
    print("\t[REQUEST INCOMPLETE]\t",end="")
    print("<Error ~ [{}]>".format(e))

这篇关于使用python下载文件(REST URL)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆