urllib2中未知的url类型错误 [英] Unknown url type error in urllib2

查看:82
本文介绍了urllib2中未知的url类型错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在SO上搜索了很多类似的问题,但没有找到与我的案子完全匹配的内容.

I have searched a lot of similar question on SO, but did not find an exact match to my case.

我正在尝试使用python 2.7下载视频

I am trying to download a video using python 2.7

这是我下载视频的代码

import urllib2
from bs4 import BeautifulSoup as bs


with open('video.txt','r') as f:
    last_downloaded_video = f.read()

webpage = urllib2.urlopen('http://*.net/watch/**-'+last_downloaded_video)

soup = bs(webpage)
a = []
for link in soup.find_all('a'):
    if link.has_attr('data-video-id'):
        a.append(link)

#try just with first data-video-id

id = a[0]['data-video-id']
webpage2 = urllib2.urlopen('http://*/video/play/'+id)
soup = bs(webpage2)
string = str(soup.find_all('script')[2])
print string
url = string.split(': ')[1].split(',')[0]
url = url.replace('"','')
print url
print type(url)

video = urllib2.urlopen(url).read()
filename = "video.mp4"
with open(filename,'wb') as f:
    f.write(video)

此代码给出了未知的url类型错误.追溯是

This code gives an unknown url type error. The traceback is

Traceback (most recent call last):
  File "naruto.py", line 26, in <module>
    video = urllib2.urlopen(url).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 427, in _open
    'unknown_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1247, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: 'http>

但是,当我将相同的url存储在变量中并尝试从终端下载它时,未显示任何错误. 我对问题是什么感到困惑. 我在python邮件列表中

However, when i store the same url in a variable and attempt to download it from terminal, no error is shown. I am confused as to what the problem is. I got a similar question in python mailing list

推荐答案

如果不查看页面中正在刮擦的HTML,就很难分辨,但是,在开始处会有一个迷离的'(单引号)字符. URL可能是原因-这会导致相同的异常:

It's hard to tell without seeing the HTML from the page that you are scraping, however, a stray ' (single quote) character at the beginning of the URL might be the cause - this causes the same exception:

>>> import urllib2
>>> urllib2.urlopen("'http://blah.com")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "urllib2.py", line 404, in open
    response = self._open(req, data)
  File "urllib2.py", line 427, in _open
    'unknown_open', req)
  File "urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "urllib2.py", line 1249, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: 'http>

因此,请尝试清理您的URL并删除所有引号.

So, try cleaning up your URL and remove any stray quotes.

OP反馈后更新:

print语句的结果表明,URL在URL字符串的开头和结尾均带有单引号字符.传递给urlopen()时,URL周围不应包含任何类型的 any 引号.您可以使用以下方法从URL字符串中删除前引号和尾引号(单引号和双引号):

The results of the print statement indicate that the URL has a single quote character at the beginning and end of the URL string. There should not any quotes of any type surrounding the URL when it is passed to urlopen(). You can remove leading and trailing quotes (both single and double) from the URL string with this:

url = url.strip('\'"')

这篇关于urllib2中未知的url类型错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆