如何从 Scrapy 获得 UTF-8 编码的 unicode 输出? [英] How can I get an output in UTF-8 encoded unicode from Scrapy?
问题描述
忍耐一下.我正在编写每一个细节,因为工具链的很多部分都没有很好地处理 Unicode,而且不清楚是什么地方出了问题.
前奏
我们首先设置并使用最近的 Scrapy.
source ~/.scrapy_1.1.2/bin/activate
由于终端默认是ascii,不是unicode,我们设置:
export LC_ALL=en_US.UTF-8导出 LANG=en_US.UTF-8
此外,由于 Python 默认使用 ascii,我们修改了编码:
导出 PYTHONIOENCODING="utf_8"
现在我们准备开始一个 Scrapy 项目.
scrapy startproject myproject光盘我的项目scrapy genspider dorf PLACEHOLDER
我们被告知我们现在有一只蜘蛛.
使用模块中的模板basic"创建蜘蛛dorf":myproject.spider.dorf
我们将myproject/items.py
修改为:
# -*- 编码:utf-8 -*-导入scrapy类 MyprojectItem(scrapy.Item):标题 = scrapy.Field()
尝试 1
现在我们编写spider,依赖于urllib.unquote
# -*- 编码:utf-8 -*-导入scrapy导入 urllib从 myproject.items 导入 MyprojectItem类 DorfSpider(scrapy.Spider):名称 = "多夫"allowed_domains = [u'http://en.sistercity.info/']start_urls = (u'http://en.sistercity.info/sister-cities/Düsseldorf.html',)定义解析(自我,响应):item = MyprojectItem()item['title'] = urllib.unquote(response.xpath('//title').extract_first().encode('ascii')).decode('utf8')归还物品
最后我们使用自定义项目导出器(从 2011 年 10 月开始)
# -*- 编码:utf-8 -*-导入json从 scrapy.exporters 导入 BaseItemExporter类 UnicodeJsonLinesItemExporter(BaseItemExporter):def __init__(self, file, **kwargs):self._configure(kwargs)self.file = 文件self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)def export_item(self, item):itemdict = dict(self._get_serialized_fields(item))self.file.write(self.encoder.encode(itemdict) + '\n')
并添加
FEED_EXPORTERS = {'json': 'myproject.exporters.UnicodeJsonLinesItemExporter',}
到myproject/settings.py
.
现在我们运行
~/myproject>scrapy crawl dorf -o dorf.json -t json
我们得到
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 25: ordinal not in range(128)
尝试 2
另一个解决方案(Scrapy 1.2 的候选解决方案?)是使用蜘蛛
# -*- 编码:utf-8 -*-导入scrapy从 myproject.items 导入 MyprojectItem类 DorfSpider(scrapy.Spider):名称 = "多夫"allowed_domains = [u'http://en.sistercity.info/']start_urls = (u'http://en.sistercity.info/sister-cities/Düsseldorf.html',)定义解析(自我,响应):item = MyprojectItem()item['title'] = response.xpath('//title')[0].extract()归还物品
# -*- 编码:utf-8 -*-从 scrapy.exporters 导入 JsonItemExporter类 Utf8JsonItemExporter(JsonItemExporter):def __init__(self, file, **kwargs):super(Utf8JsonItemExporter, self).__init__(文件,ensure_ascii=False,**kwargs)
与
FEED_EXPORTERS = {'json': 'myproject.exporters.Utf8JsonItemExporter',}
在 myproject/settings.py
中.
我们得到以下 JSON 文件.
<预><代码>[{"title": "<title>D\u00fcsseldorf 的姐妹城市 \u2014istercity.info</title>"}]Unicode 不是 UTF-8 编码的.虽然这对于几个字符来说是一个小问题,但如果整个输出都是外语,这就会成为一个严重的问题.
如何获得 UTF-8 unicode 格式的输出?
请在您的 Attempt 1 上试试这个,然后让我知道它是否有效(我已经测试过了,没有设置所有这些 env.变量)
def to_write(uni_str):返回 urllib.unquote(uni_str.encode('utf8')).decode('utf8')类城市蜘蛛(scrapy.Spider):名称 = "城市"allowed_domains = ["sitercity.info"]start_urls = ('http://en.sistercity.info/sister-cities/Düsseldorf.html',)定义解析(自我,响应):对于范围内的我(2):item = SimpleItem()item['title'] = to_write(response.xpath('//title').extract_first())item['url'] = to_write(response.url)产量项目
range(2)
用于测试 json 导出器,要获取您可以执行此操作的 dict 列表:
# -*- 编码:utf-8 -*-从 scrapy.contrib.exporter 导入 JsonItemExporter从 scrapy.utils.serialize 导入 ScrapyJSONEncoder类 UnicodeJsonLinesItemExporter(JsonItemExporter):def __init__(self, file, **kwargs):self._configure(kwargs, dont_fail=True)self.file = 文件self.encoder = ScrapyJSONEncoder(ensure_ascii=False, **kwargs)self.first_item = True
Bear with me. I'm writing every detail because so many parts of the toolchain do not handle Unicode gracefully and it's not clear what is failing.
PRELUDE
We first set up and use a recent Scrapy.
source ~/.scrapy_1.1.2/bin/activate
Since the terminal's default is ascii, not unicode, we set:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
Also since by default Python uses ascii, we modify the encoding:
export PYTHONIOENCODING="utf_8"
Now we're ready to start a Scrapy project.
scrapy startproject myproject
cd myproject
scrapy genspider dorf PLACEHOLDER
We're told we now have a spider.
Created spider 'dorf' using template 'basic' in module:
myproject.spiders.dorf
We modify myproject/items.py
to be:
# -*- coding: utf-8 -*-
import scrapy
class MyprojectItem(scrapy.Item):
title = scrapy.Field()
ATTEMPT 1
Now we write the spider, relying on urllib.unquote
# -*- coding: utf-8 -*-
import scrapy
import urllib
from myproject.items import MyprojectItem
class DorfSpider(scrapy.Spider):
name = "dorf"
allowed_domains = [u'http://en.sistercity.info/']
start_urls = (
u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
)
def parse(self, response):
item = MyprojectItem()
item['title'] = urllib.unquote(
response.xpath('//title').extract_first().encode('ascii')
).decode('utf8')
return item
And finally we use a custom item exporter (from all the way back in Oct 2011)
# -*- coding: utf-8 -*-
import json
from scrapy.exporters import BaseItemExporter
class UnicodeJsonLinesItemExporter(BaseItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs)
self.file = file
self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)
def export_item(self, item):
itemdict = dict(self._get_serialized_fields(item))
self.file.write(self.encoder.encode(itemdict) + '\n')
and add
FEED_EXPORTERS = {
'json': 'myproject.exporters.UnicodeJsonLinesItemExporter',
}
to myproject/settings.py
.
Now we run
~/myproject> scrapy crawl dorf -o dorf.json -t json
we get
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 25: ordinal not in range(128)
ATTEMPT 2
Another solution (the candidate solution for Scrapy 1.2?) is to use the spider
# -*- coding: utf-8 -*-
import scrapy
from myproject.items import MyprojectItem
class DorfSpider(scrapy.Spider):
name = "dorf"
allowed_domains = [u'http://en.sistercity.info/']
start_urls = (
u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
)
def parse(self, response):
item = MyprojectItem()
item['title'] = response.xpath('//title')[0].extract()
return item
and the custom item exporter
# -*- coding: utf-8 -*-
from scrapy.exporters import JsonItemExporter
class Utf8JsonItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
super(Utf8JsonItemExporter, self).__init__(
file, ensure_ascii=False, **kwargs)
with
FEED_EXPORTERS = {
'json': 'myproject.exporters.Utf8JsonItemExporter',
}
in myproject/settings.py
.
We get the following JSON file.
[
{"title": "<title>Sister cities of D\u00fcsseldorf \u2014 sistercity.info</title>"}
]
The Unicode is not UTF-8 encoded. Although this is a trivial problem for a couple of characters, it becomes a serious issue if the entire output is in a foreign language.
How can I get an output in UTF-8 unicode?
please try this on your Attempt 1 and let me know if it works (I've test it without setting all those env. variables)
def to_write(uni_str):
return urllib.unquote(uni_str.encode('utf8')).decode('utf8')
class CitiesSpider(scrapy.Spider):
name = "cities"
allowed_domains = ["sitercity.info"]
start_urls = (
'http://en.sistercity.info/sister-cities/Düsseldorf.html',
)
def parse(self, response):
for i in range(2):
item = SimpleItem()
item['title'] = to_write(response.xpath('//title').extract_first())
item['url'] = to_write(response.url)
yield item
the range(2)
is for testing the json exporter, to get a list of dicts you can do this instead:
# -*- coding: utf-8 -*-
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
class UnicodeJsonLinesItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs, dont_fail=True)
self.file = file
self.encoder = ScrapyJSONEncoder(ensure_ascii=False, **kwargs)
self.first_item = True
这篇关于如何从 Scrapy 获得 UTF-8 编码的 unicode 输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!