UnicodeDecodeError:'utf8'编解码器无法解码字节0xc3在位置34:意外的数据结束 [英] UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

查看:5259
本文介绍了UnicodeDecodeError:'utf8'编解码器无法解码字节0xc3在位置34:意外的数据结束的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想写一个剪贴簿,但我遇到了编码问题。当我试图复制我正在寻找到我的文本文件的字符串, python2.7 告诉我它不能识别编码,尽管没有特殊字符。不知道这是否有用的信息。



我的代码如下所示:

 来自urllib import FancyURLopener 
import os

class MyOpener(FancyURLopener):#spoofs一个真实的浏览器窗口
version ='Mozilla / 5.0(Windows; U; Windows NT 5.1; it; rv:1.8.1.11)Gecko / 20071127 Firefox / 2.0.0.11'

print什么是webaddress?
webaddress = raw_input(8 ::>)

打印文件夹名称?
foldername = raw_input(8 ::>)

如果不是os.path.exists(文件夹名):
os.makedirs(文件夹名)

def urlpuller(start,page):
while page [start]!='':
start + = 1
close = start
!='':
close + = 1
return page [start:close]

myopener = MyOpener()

response = myopener。 open(webaddress)
site = response.read()

nexturl =''
counter = 0

while(nexturl!= webaddress):
counter + = 1
start = 0

for i in range(len(site)-35):
if site [i:i + 35]。 decode('utf-8')== u'< img id =imgSizedclass =slideImg':
start = i + 40
break
else:
printSomething's broken,chief。Error = 1

next = 0

for i in range(start,8,-1):
if site [i:i + 8] == u'< a href =':
next = i
break
else:
print Error = 2

nexturl = urlpuller(next,site)

myopener.retrieve(urlpuller(start,site),foldername +'/'+ foldername + str +'。jpg')

print(Retrieval of+ foldername +completed。)


b $ b

当我尝试使用我使用的网站运行它时,它返回错误:

  (最近最近一次调用):
在< module>中的文件yada / yadayada / Python / scraper.py,第37行,
if site [i:i + 35] .decode -8')== u'< img id =imgSizedclass =slideImg':
文件/usr/lib/python2.7/encodings/utf_8.py,第16行,
return codecs.utf_8_decode(input,errors,True)
UnicodeDecodeError:'utf8'编解码器无法解码字节0xc3在位置34:意外结束数据

指向 http://google.com 时,它工作正常罚款。

 < meta http-equiv =Content-Typecontent =text / html; charset = utf-8> 

但是当我尝试使用utf-8解码时,

解决方案

<$>



< p $ p> site [i:i + 35] .decode('utf-8')


b $ b

你不能随机分割你收到的字节,然后要求UTF-8解码它。UTF-8是一个多字节编码,意思是你可以有1到6个字节表示一个字符。你把它砍成一半,并要求Python解码它,它会抛出你的意外的数据错误。



查看为您制作的工具。 BeautifulSoup lxml 是两种选择。


I'm trying to write a scrapper, but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file, python2.7 told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.

My code looks like this:

from urllib import FancyURLopener
import os

class MyOpener(FancyURLopener): #spoofs a real browser on Window
   version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'

print "What is the webaddress?"
webaddress = raw_input("8::>")

print "Folder Name?"
foldername = raw_input("8::>")

if not os.path.exists(foldername):
    os.makedirs(foldername)

def urlpuller(start, page):
   while page[start]!= '"':
      start += 1
   close = start
   while page[close]!='"':
      close += 1
   return page[start:close]

myopener = MyOpener()

response = myopener.open(webaddress)
site = response.read()

nexturl = ''
counter = 0

while(nexturl!=webaddress):
   counter += 1
   start = 0

   for i in range(len(site)-35):
       if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
         start = i + 40
         break
   else:
      print "Something's broken, chief. Error = 1"

   next = 0

   for i in range(start, 8, -1):
      if site[i:i+8] == u'<a href=':
         next = i
         break
   else:
      print "Something's broken, chief. Error = 2"

   nexturl = urlpuller(next, site)

   myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')

print("Retrieval of "+foldername+" completed.")

When I try to run it using the site I'm using, it returns the error:

Traceback (most recent call last):
  File "yada/yadayada/Python/scraper.py", line 37, in <module>
    if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

When pointed at http://google.com, it worked just fine.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

but when I try to decode using utf-8, as you can see, it does not work.

Any suggestions?

解决方案

site[i:i+35].decode('utf-8')

You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.

Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.

这篇关于UnicodeDecodeError:'utf8'编解码器无法解码字节0xc3在位置34:意外的数据结束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆