如何从 Scrapy 获得 UTF-8 编码的 unicode 输出? [英] How can I get an output in UTF-8 encoded unicode from Scrapy?

查看:105
本文介绍了如何从 Scrapy 获得 UTF-8 编码的 unicode 输出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

忍耐一下.我正在编写每一个细节,因为工具链的很多部分都没有很好地处理 Unicode,而且不清楚是什么地方出了问题.

前奏

我们首先设置并使用最近的 Scrapy.

source ~/.scrapy_1.1.2/bin/activate

由于终端默认是ascii,不是unicode,我们设置:

export LC_ALL=en_US.UTF-8导出 LANG=en_US.UTF-8

此外,由于 Python 默认使用 ascii,我们修改了编码:

导出 PYTHONIOENCODING="utf_8"

现在我们准备开始一个 Scrapy 项目.

scrapy startproject myproject光盘我的项目scrapy genspider dorf PLACEHOLDER

我们被告知我们现在有一只蜘蛛.

使用模块中的模板basic"创建蜘蛛dorf":myproject.spider.dorf

我们将myproject/items.py修改为:

# -*- 编码:utf-8 -*-导入scrapy类 MyprojectItem(scrapy.Item):标题 = scrapy.Field()

尝试 1

现在我们编写spider,依赖于urllib.unquote

# -*- 编码:utf-8 -*-导入scrapy导入 urllib从 myproject.items 导入 MyprojectItem类 DorfSpider(scrapy.Spider):名称 = "多夫"allowed_domains = [u'http://en.sistercity.info/']start_urls = (u'http://en.sistercity.info/sister-cities/Düsseldorf.html',)定义解析(自我,响应):item = MyprojectItem()item['title'] = urllib.unquote(response.xpath('//title').extract_first().encode('ascii')).decode('utf8')归还物品

最后我们使用自定义项目导出器(从 2011 年 10 月开始)

# -*- 编码:utf-8 -*-导入json从 scrapy.exporters 导入 BaseItemExporter类 UnicodeJsonLinesItemExporter(BaseItemExporter):def __init__(self, file, **kwargs):self._configure(kwargs)self.file = 文件self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)def export_item(self, item):itemdict = dict(self._get_serialized_fields(item))self.file.write(self.encoder.encode(itemdict) + '\n')

并添加

FEED_EXPORTERS = {'json': 'myproject.exporters.UnicodeJsonLinesItemExporter',}

myproject/settings.py.

现在我们运行

~/myproject>scrapy crawl dorf -o dorf.json -t json

我们得到

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 25: ordinal not in range(128)

尝试 2

另一个解决方案(Scrapy 1.2 的候选解决方案?)是使用蜘蛛

# -*- 编码:utf-8 -*-导入scrapy从 myproject.items 导入 MyprojectItem类 DorfSpider(scrapy.Spider):名称 = "多夫"allowed_domains = [u'http://en.sistercity.info/']start_urls = (u'http://en.sistercity.info/sister-cities/Düsseldorf.html',)定义解析(自我,响应):item = MyprojectItem()item['title'] = response.xpath('//title')[0].extract()归还物品

自定义项目导出器

# -*- 编码:utf-8 -*-从 scrapy.exporters 导入 JsonItemExporter类 Utf8JsonItemExporter(JsonItemExporter):def __init__(self, file, **kwargs):super(Utf8JsonItemExporter, self).__init__(文件,ensure_ascii=False,**kwargs)

FEED_EXPORTERS = {'json': 'myproject.exporters.Utf8JsonItemExporter',}

myproject/settings.py 中.

我们得到以下 JSON 文件.

<预><代码>[{"title": "<title>D\u00fcsseldorf 的姐妹城市 \u2014istercity.info</title>"}]

Unicode 不是 UTF-8 编码的.虽然这对于几个字符来说是一个小问题,但如果整个输出都是外语,这就会成为一个严重的问题.

如何获得 UTF-8 unicode 格式的输出?

解决方案

请在您的 Attempt 1 上试试这个,然后让我知道它是否有效(我已经测试过了,没有设置所有这些 env.变量)

def to_write(uni_str):返回 urllib.unquote(uni_str.encode('utf8')).decode('utf8')类城市蜘蛛(scrapy.Spider):名称 = "城市"allowed_domains = ["sitercity.info"]start_urls = ('http://en.sistercity.info/sister-cities/Düsseldorf.html',)定义解析(自我,响应):对于范围内的我(2):item = SimpleItem()item['title'] = to_write(response.xpath('//title').extract_first())item['url'] = to_write(response.url)产量项目

range(2) 用于测试 json 导出器,要获取您可以执行此操作的 dict 列表:

# -*- 编码:utf-8 -*-从 scrapy.contrib.exporter 导入 JsonItemExporter从 scrapy.utils.serialize 导入 ScrapyJSONEncoder类 UnicodeJsonLinesItemExporter(JsonItemExporter):def __init__(self, file, **kwargs):self._configure(kwargs, dont_fail=True)self.file = 文件self.encoder = ScrapyJSONEncoder(ensure_ascii=False, **kwargs)self.first_item = True

Bear with me. I'm writing every detail because so many parts of the toolchain do not handle Unicode gracefully and it's not clear what is failing.

PRELUDE

We first set up and use a recent Scrapy.

source ~/.scrapy_1.1.2/bin/activate

Since the terminal's default is ascii, not unicode, we set:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Also since by default Python uses ascii, we modify the encoding:

export PYTHONIOENCODING="utf_8"

Now we're ready to start a Scrapy project.

scrapy startproject myproject
cd myproject
scrapy genspider dorf PLACEHOLDER

We're told we now have a spider.

Created spider 'dorf' using template 'basic' in module:
  myproject.spiders.dorf

We modify myproject/items.py to be:

# -*- coding: utf-8 -*-
import scrapy

class MyprojectItem(scrapy.Item):
    title = scrapy.Field()

ATTEMPT 1

Now we write the spider, relying on urllib.unquote

# -*- coding: utf-8 -*-
import scrapy
import urllib
from myproject.items import MyprojectItem

class DorfSpider(scrapy.Spider):
    name = "dorf"
    allowed_domains = [u'http://en.sistercity.info/']
    start_urls = (
        u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        item = MyprojectItem()
        item['title'] = urllib.unquote(
            response.xpath('//title').extract_first().encode('ascii')
        ).decode('utf8')
        return item

And finally we use a custom item exporter (from all the way back in Oct 2011)

# -*- coding: utf-8 -*-
import json
from scrapy.exporters import BaseItemExporter

class UnicodeJsonLinesItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)

    def export_item(self, item):
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict) + '\n')

and add

FEED_EXPORTERS = {
    'json': 'myproject.exporters.UnicodeJsonLinesItemExporter',
}

to myproject/settings.py.

Now we run

~/myproject> scrapy crawl dorf -o dorf.json -t json

we get

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 25: ordinal not in range(128)

ATTEMPT 2

Another solution (the candidate solution for Scrapy 1.2?) is to use the spider

# -*- coding: utf-8 -*-
import scrapy
from myproject.items import MyprojectItem

class DorfSpider(scrapy.Spider):
    name = "dorf"
    allowed_domains = [u'http://en.sistercity.info/']
    start_urls = (
        u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        item = MyprojectItem()
        item['title'] = response.xpath('//title')[0].extract()
        return item

and the custom item exporter

# -*- coding: utf-8 -*-
from scrapy.exporters import JsonItemExporter

class Utf8JsonItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        super(Utf8JsonItemExporter, self).__init__(
            file, ensure_ascii=False, **kwargs)

with

FEED_EXPORTERS = {
    'json': 'myproject.exporters.Utf8JsonItemExporter',
}

in myproject/settings.py.

We get the following JSON file.

[
{"title": "<title>Sister cities of D\u00fcsseldorf \u2014 sistercity.info</title>"}
]

The Unicode is not UTF-8 encoded. Although this is a trivial problem for a couple of characters, it becomes a serious issue if the entire output is in a foreign language.

How can I get an output in UTF-8 unicode?

解决方案

please try this on your Attempt 1 and let me know if it works (I've test it without setting all those env. variables)

def to_write(uni_str):
    return urllib.unquote(uni_str.encode('utf8')).decode('utf8')


class CitiesSpider(scrapy.Spider):
    name = "cities"
    allowed_domains = ["sitercity.info"]
    start_urls = (
        'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        for i in range(2):
            item = SimpleItem()
            item['title'] = to_write(response.xpath('//title').extract_first())
            item['url'] = to_write(response.url)
            yield item

the range(2) is for testing the json exporter, to get a list of dicts you can do this instead:

# -*- coding: utf-8 -*-
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder

class UnicodeJsonLinesItemExporter(JsonItemExporter):
    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(ensure_ascii=False, **kwargs)
        self.first_item = True

这篇关于如何从 Scrapy 获得 UTF-8 编码的 unicode 输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆