Python，scrapy：bad utf8字符在文件中从带有字符集iso-8859-1的html页面 [英] Python, scrapy : bad utf8 characters writed in file from scraped html page with charset iso-8859-1

查看：456 发布时间：2016/11/19 17:05:34 python python-2.7 utf-8 character-encoding scrapy

本文介绍了Python，scrapy：bad utf8字符在文件中从带有字符集iso-8859-1的html页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用Scrapy在python 2.7中删除带有字符集 iso-8859-1 的网页。我在网页上感兴趣的文本是：tempête

Scrapy将响应作为带字符的UTF8 unicode返回正确编码：

 >>> response 
 u'temp\xc3 \xaate'

现在，我想写在文件中的单词tempête，因此我执行以下操作：

 >>>> import codecs 
>>>> file = codecs.open（'test'，'a'，encoding ='utf-8'）
>>> file.write（response）#response是上面的var

当我打开文件时，是tempÃªte。看起来python没有检测到正确的编码，不能读取两个字节编码的字符，认为它是两个一个编码的字符。

我如何处理这个简单的使用在你的例子中， response 是一个（解码的） Unicode字符串 \xc3\xa 里面，那么在scrapy编码检测级别有问题。

code> \xc3 \xa 是编码为UTF-8的字符ê，因此您应该只看到）非Unicode / str 字符串（在Python 2中）

Python 2.7 shell会话：

 >>> ＃你的输入应该是什么样子
>>> tempete =u'tempête'
>>>> tempete 
 u'temp\xeate'
 
>>>> ＃UTF-8 encoded 
>>> tempete.encode（'utf-8'）
'temp\xc3\xaate'
>>>> 
>>>> ＃latin1 encoded 
>>>> tempete.encode（'iso-8859-1'）
'temp \xeate'
>>>> 
 
>>>> ＃back to your sample 
>>> s = u'temp\xc3\xaate'
>>>> print s 
tempÃªte
>>>> 
>>>> ＃如果使用带有这些字符的非Unicode字符串... 
>>> s_raw ='temp\xc3\xaate'
>>>> s_raw.decode（'utf-8'）
 u'temp\xeate'
>>>> 
>>>> ＃...从UTF-8工作原理
>>> print s_raw.decode（'utf-8'）
tempête
>>>

Scrapy解释页面错误为 iso-8859-1 编码。

 
 
 您可以通过重新创建 response.body  ：
 >>> import scrapy.http 
>>>> hr1 = scrapy.http.HtmlResponse（url ='http：//www.example'，body ='< html>< body> temp\xc3\aaate< / body>< / html& ='latin1'）
>>>> hr1.body_as_unicode（）
 u'< html>< body> temp\xc3\xaate< / body>< / html>'
>>> hr2 = scrapy.http.HtmlResponse（url ='http：//www.example'，body ='< html>< body> temp\xc3\aaate< / body>< / html> ='utf-8'）
>>>> hr2.body_as_unicode（）
 u'< html>< body> temp\xeate< / body>< / html>'
>>> 
  
建立新回应
  newresponse = response.replace（encoding ='utf-8'）
  
并使用 newresponse 而不是
 
I want to scrap a webpage with charset iso-8859-1 with Scrapy, in python 2.7. The text i'm interesting in on the webpage is : tempête

Scrapy returns response as an UTF8 unicode with characters correctly encoded :
>>> response
u'temp\xc3\xaate'
Now, I want to write the word tempête in a file, so I'm doing the following :
>>> import codecs
>>> file = codecs.open('test', 'a', encoding='utf-8')
>>> file.write(response) #response is the above var
When I open the file, the resulting text is tempÃªte. It seems that python does not detect proper encoding and can't read the two bytes encoded char and thinks it's two one-coded char.

How can I handle this simple use case ?
 解决方案 
In your example, response is a (decoded) Unicode string with \xc3\xa inside, then something is wrong at scrapy encoding detection level.

\xc3\xa is character ê encoded as UTF-8, so you should only see those character for (encoded) non-Unicode/str strings (in Python 2 that is)

Python 2.7 shell session:
>>> # what your input should look like
>>> tempete = u'tempête'
>>> tempete
u'temp\xeate'

>>> # UTF-8 encoded
>>> tempete.encode('utf-8')
'temp\xc3\xaate'
>>>
>>> # latin1 encoded
>>> tempete.encode('iso-8859-1')
'temp\xeate'
>>> 

>>> # back to your sample
>>> s = u'temp\xc3\xaate'
>>> print s
tempÃªte
>>>
>>> # if you use a non-Unicode string with those characters...
>>> s_raw = 'temp\xc3\xaate'
>>> s_raw.decode('utf-8')
u'temp\xeate'
>>> 
>>> # ... decoding from UTF-8 works
>>> print s_raw.decode('utf-8')
tempête
>>> 
Something is wrong with Scrapy interpreting page as iso-8859-1 encoded.

You can force the encoding by re-building a response from response.body:
>>> import scrapy.http
>>> hr1 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='latin1')
>>> hr1.body_as_unicode()
u'<html><body>temp\xc3\xaate</body></html>'
>>> hr2 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='utf-8')
>>> hr2.body_as_unicode()
u'<html><body>temp\xeate</body></html>'
>>> 
Build a new reponse
newresponse = response.replace(encoding='utf-8')
and work with newresponse instead

                        这篇关于Python，scrapy：bad utf8字符在文件中从带有字符集iso-8859-1的html页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Python，scrapy：bad utf8字符在文件中从带有字符集iso-8859-1的html页面 [英] Python, scrapy : bad utf8 characters writed in file from scraped html page with charset iso-8859-1

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python，scrapy：bad utf8字符在文件中从带有字符集iso-8859-1的html页面 [英] Python, scrapy : bad utf8 characters writed in file from scraped html page with charset iso-8859-1

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭