python和scrapy编码问题 [英] python and scrapy THE encoding issue

查看：149 发布时间：2017/8/16 23:41:45 python unicode encoding utf-8 scrapy

本文介绍了python和scrapy编码问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我简单不了解！ :(
我正在从一个UTF-8编码的网站上取消数据，这至少是说：

I simple can't figure out! :( I am scrapping data from an utf-8 encoded site, well that is at least what it says:

Content-Type: text/html;charset=utf-8

unicode字符串与XPath选择器extract（）调用：

I am getting a list of regular unicode strings with XPath selector extract() call:

item['city']= element.select('//div[@id="bubble_2"]/div/text()').extract()

列表：

[u'Westbahnhofstr.\xa010', u'72070\xa0T\xfcbingen']

现在我将列表加入一个unicode字符串：

Now I join the list into one unicode string:

item['city']= "".join(element.select('//div[@id="bubble_2"]/div/text()').extract())

到目前为止这么好：

u'Beim Nonnenhaus\xa0672070\xa0T\xfcbingen'

当我尝试将该unicode字符串输出到屏幕（打印）或到一个文件（写），无论我尝试它返回错误（ http://pastebin.com/51DkX2R2 ）：

The problem appears while I try to output this unicode string either to screen (print) or to a file (write). whatever I try it returns an error (http://pastebin.com/51DkX2R2):

exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in   position 11: ordinal not in range(128)

在输出之前，我已经将unicode编码为字节字符串：

I have encoded unicode to byte string before output of course:

item['city'].encode('utf-8')

这是我的pipeline.py，以及我如何用来打开和写入我的cvs：

This is my pipeline.py and how I use to open and write to my cvs:

import csv
import items
import urlparse
import codecs

class DepostPipeline(object):
    def __init__(self):
        self.modelsCsv = csv.writer(codecs.open('Dees.csv', mode='w',encoding='utf-8'))
        self.modelsCsv.writerow(['city'])

def process_item(self, item, spider):
    if isinstance(item, items.DetailsItem): 
        item['city'] = item['city'].encode('utf-8')

        self.modelsCsv.writerow([item['city']]) 
        return item

最奇怪的是我的系统（Windows上的python）完美地处理unicode字符串： p>

The most weird thing is that my system (python on windows) handles unicode strings perfectly:

C:\Console2>python
Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s=u'Beim Nonnenhaus\xa0672070\xa0T\xfcbingen'
>>> print s
Beim Nonnenhaus 672070 Tübingen

我一直在阅读关于utf-8，unicode，编码和解码在过去10天很多，但似乎我仍然想念这里的东西？
我感谢任何帮助或建议。

I have been reading about utf-8, unicode, encoding and decoding a lot over the last 10 days but it seems that I still miss something here?! I appreciate any help or advice.

推荐答案

你忽略了的结果。 encode（） call：

item['city'].encode('utf-8')

字符串是不可变的，不会被原位编码。更好的是，返回对象的类型是不同的。您需要返回返回值：

Strings are immutable, and are not encoded in-place. Even better, the type of the returned object is different. You'll need to assign the return value back:

item['city'] = item['city'].encode('utf-8')

但是，您应该不 code> codecs.open（）为CSV文件。 csv 模块将始终使用测试程序，而不是Unicode。

However, you should not use codecs.open() for the CSV file. The csv module will always write bytestrings, not Unicode.

通过使用 codecs.open（）文件对象，隐式解码回到Unicode的地方，那就是那个你失败了这就是为什么你得到一个 UnicodeDecodeError 异常，而不是一个编码错误：

By using a codecs.open() file object, an implicit decode takes place to get back to Unicode, and it is that that fails for you; it is why you get a UnicodeDecodeError exception, not an encode error:

  File "C:\Python27\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 11: ordinal not in range(128)

使用

self.modelsCsv = csv.writer(open('Dees.csv', mode='wb'))

code>的 'Wb'; csv 模块会自动处理行结束。

Note the 'wb'; the csv module handles line endings itself.

这篇关于python和scrapy编码问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

python和scrapy编码问题 [英] python and scrapy THE encoding issue

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python和scrapy编码问题 [英] python and scrapy THE encoding issue

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭