抓取文本编码 [英] scrapy text encoding

查看:36
本文介绍了抓取文本编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的蜘蛛

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem

class vriskoSpider(CrawlSpider):
    name = 'vrisko'
    allowed_domains = ['vrisko.gr']
    start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)

    def parse_start_url(self, response):
        hxs = HtmlXPathSelector(response)
        vriskoit = VriskoItem()
        vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
        vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
        return vriskoit

我的问题是返回的字符串是 unicode,我想将它们编码为 utf-8.我不知道这是最好的方法.我试了几种方法都没有结果.

My problem is that the returned strings are unicode and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.

先谢谢你!

推荐答案

Scrapy 返回 unicode 字符串,而不是 ascii.要将所有字符串编码为 utf-8,您可以编写:

Scrapy returns strings in unicode, not ascii. To encode all strings to utf-8, you can write:

vriskoit['eponimia'] = [s.encode('utf-8') for s in hxs.select('//a[@itemprop="name"]/text()').extract()]

但我认为你期待另一个结果.您的代码会返回一项包含所有搜索结果的项目.为每个结果返回项目:

But I think that you expect another result. Your code return one item with all search results. To return items for each result:

hxs = HtmlXPathSelector(response)
for eponimia, address in zip(hxs.select("//a[@itemprop='name']/text()").extract(),
                             hxs.select("//div[@class='results_address_class']/text()").extract()):
    vriskoit = VriskoItem()
    vriskoit['eponimia'] = eponimia.encode('utf-8')
    vriskoit['address'] = address.encode('utf-8')
    yield vriskoit

<小时>

更新

JSON 导出器默认写入转义的 unicode 符号(例如 \u03a4),因为并非所有流都可以处理 unicode.它可以选择将它们编写为 unicode ensure_ascii=False(请参阅 json.dumps) .但我找不到将此选项传递给标准 Feed 导出器的方法.

JSON exporter writes unicode symbols escaped (e.g. \u03a4) by default, because not all streams can handle unicode. It has option to write them as unicode ensure_ascii=False (see docs for json.dumps) . But I can't find way to pass this option to standard feed exporter.

因此,如果您希望导出的项目以 utf-8 编码编写,例如要在文本编辑器中阅读它们,您可以编写自定义项目管道.

So if you want exported items to be written in utf-8 encoding, e.g. for read them in text editor, you can write custom item pipeline.

pipelines.py:

pipelines.py:

import json
import codecs

class JsonWithEncodingPipeline(object):

    def __init__(self):
        self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def spider_closed(self, spider):
        self.file.close()

不要忘记将此管道添加到 settings.py:

Don't forget to add this pipeline to settings.py:

 ITEM_PIPELINES = ['vrisko.pipelines.JsonWithEncodingPipeline']

您可以自定义管道以更易读的格式写入数据,例如您可以生成一些格式化的报告.JsonWithEncodingPipeline 只是基本示例.

You can customize pipeline to write data in more human readable format, e.g. you can generate some formated report. JsonWithEncodingPipeline is just basic example.

这篇关于抓取文本编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆