抓取文本编码 [英] scrapy text encoding
问题描述
这是我的蜘蛛
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem
class vriskoSpider(CrawlSpider):
name = 'vrisko'
allowed_domains = ['vrisko.gr']
start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)
def parse_start_url(self, response):
hxs = HtmlXPathSelector(response)
vriskoit = VriskoItem()
vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
return vriskoit
我的问题是返回的字符串是 unicode,我想将它们编码为 utf-8.我不知道这是最好的方法.我试了几种方法都没有结果.
My problem is that the returned strings are unicode and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.
先谢谢你!
推荐答案
Scrapy 返回 unicode 字符串,而不是 ascii.要将所有字符串编码为 utf-8,您可以编写:
Scrapy returns strings in unicode, not ascii. To encode all strings to utf-8, you can write:
vriskoit['eponimia'] = [s.encode('utf-8') for s in hxs.select('//a[@itemprop="name"]/text()').extract()]
但我认为你期待另一个结果.您的代码会返回一项包含所有搜索结果的项目.为每个结果返回项目:
But I think that you expect another result. Your code return one item with all search results. To return items for each result:
hxs = HtmlXPathSelector(response)
for eponimia, address in zip(hxs.select("//a[@itemprop='name']/text()").extract(),
hxs.select("//div[@class='results_address_class']/text()").extract()):
vriskoit = VriskoItem()
vriskoit['eponimia'] = eponimia.encode('utf-8')
vriskoit['address'] = address.encode('utf-8')
yield vriskoit
<小时>
更新
JSON 导出器默认写入转义的 unicode 符号(例如 \u03a4
),因为并非所有流都可以处理 unicode.它可以选择将它们编写为 unicode ensure_ascii=False
(请参阅 json.dumps) .但我找不到将此选项传递给标准 Feed 导出器的方法.
JSON exporter writes unicode symbols escaped (e.g. \u03a4
) by default, because not all streams can handle unicode. It has option to write them as unicode ensure_ascii=False
(see docs for json.dumps) . But I can't find way to pass this option to standard feed exporter.
因此,如果您希望导出的项目以 utf-8
编码编写,例如要在文本编辑器中阅读它们,您可以编写自定义项目管道.
So if you want exported items to be written in utf-8
encoding, e.g. for read them in text editor, you can write custom item pipeline.
pipelines.py:
pipelines.py:
import json
import codecs
class JsonWithEncodingPipeline(object):
def __init__(self):
self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
不要忘记将此管道添加到 settings.py:
Don't forget to add this pipeline to settings.py:
ITEM_PIPELINES = ['vrisko.pipelines.JsonWithEncodingPipeline']
您可以自定义管道以更易读的格式写入数据,例如您可以生成一些格式化的报告.JsonWithEncodingPipeline
只是基本示例.
You can customize pipeline to write data in more human readable format, e.g. you can generate some formated report. JsonWithEncodingPipeline
is just basic example.
这篇关于抓取文本编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!