Python搜寻器无法正常运行 [英] Python crawler does not work properly

查看:79
本文介绍了Python搜寻器无法正常运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚编写了一个Python搜寻器,以从freemidi.org下载midi文件.查看Chrome中的请求标头,我发现"Referer"属性必须为 https://freemidi.org/download-20225 (以后称为"download-20225"),如果下载页面为 https://freemidi.org/getter-20225 (以后称为"getter-20225"),以便正确下载midi文件.我是在Python中这样做的,将标头设置如下:

I'd just written a Python crawler to download midi files from freemidi.org. Looking at the request headers in Chrome, I found that the "Referer" attribute had to be https://freemidi.org/download-20225 (referred to as "download-20225" later) if the download page was https://freemidi.org/getter-20225 (referred to as "getter-20225" later) in order to download the midi file properly. I did so in Python, setting the header like this:

headers = {
    'Referer': 'https://freemidi.org/download-20225',
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

与我在Chrome中查看的请求标头完全相同,我尝试使用此行代码下载文件.

which was exactly the same as the request header I had viewed in Chrome, and I tried to download the file using this line of code.

midi = requests.get(url, headers=headers).content

但是,它不能正常工作.它没有下载Midi文件,而是下载了站点"download-20225"的html文件.后来我发现,如果我尝试直接访问站点"getter-20225",它也将我带到"download-20225".我认为它可能表明标题错误,因此将我带到了另一个网站,而不是开始下载.

However, it did not work properly. Instead of downloading the midi file, it downloaded a html file of the site "download-20225". I later found that if I tried to access the site "getter-20225" directly, it takes me to "download-20225" as well. I think it probably indicates that the header was wrong, so it took me to the other website instead of starting the download.

我刚开始编写Python搜寻器,所以有人可以帮我找出程序出了什么问题吗?

I'm quite new to writing Python crawlers, so could someone help me find what went wrong with the program?

推荐答案

这里的问题似乎是带有midi文件的页面(例如"getter-20225")想要将您重定向回歌曲页面(例如歌曲下载后的"download-20225").但是,请求仅从重定向的最后一页返回内容.

It looks like the problem here is that the page with the midi file (e.g. "getter-20225") wants to redirect you back to the song page (e.g. "download-20225") after downloading the song. However, requests is only returning the content from the final page in the redirect.

您可以将allow_redirects参数设置为False,以使请求返回"getter"页面(即midi文件)中的内容:

You can set the allow_redirects parameter to False to have requests return the content from the "getter" page (i.e. the midi file):

midi = requests.get(url, headers=headers, allow_redirects=False)

请注意,如果要将midi文件写入磁盘,则需要以二进制模式打开目标文件(因为midi文件是以字节为单位).

Note that if you want to write the midi file to disk, you will need to open your target file in binary mode (since the midi file is written in bytes).

with open('example.mid', 'wb') as ex:
    ex.write(midi.content)

这篇关于Python搜寻器无法正常运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆