Scrapy:提取带有特殊字符的文本 [英] Scrapy: extract text with special characters

查看:73
本文介绍了Scrapy:提取带有特殊字符的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy 从一些西班牙网站提取文本.显然,文本是用西班牙语写的,有些单词有特殊字符,如ñ"或í".我的问题是,当我在命令行中运行时:scrapy crawl econoticia -o prueba.json要获取包含抓取数据的文件,某些字符未以正确方式显示.例如:这是原文La exministra,procesada como partícipe a titulo lucrativo,intenta burlar a los fotógrafos"这是刮掉的文字La exministra,procesada como part\u00edcipe a titulo lucrativo,intenta burlar a los fot\u00f3grafos"我希望返回一个带有特殊字符的 json.我认为我的 spyder 代码需要一些东西来以正确的方式获取 json.这是我的间谍代码:

I'm using Scrapy for extract text from some spanish websites. Obviously, the text is written in spanish and some words have special characters like 'ñ' or 'í'. My problem is that when I run in the command line: scrapy crawl econoticia -o prueba.json to get the file with the scraped data, some characters are not shown in a proper way. For example: This is the original text "La exministra, procesada como partícipe a titulo lucrativo, intenta burlar a los fotógrafos" And this is the text scraped "La exministra, procesada como part\u00edcipe a titulo lucrativo, intenta burlar a los fot\u00f3grafos" I wish to return a json with the special characters. I presume that my spyder code need something to get the json in the right way. This is my spyder code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from pais.items import PaisItem


class NoticiaSpider(scrapy.Spider):
   name = "noticia"
   allowed_domains = ["elpais.com"]
start_urls = (...

)

def parse(self, response):

    hxs = HtmlXPathSelector(response)        
    item= PaisItem()
    item['subtitulo']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[1]/span/text()').extract()
    item['titular']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[3]/div[2]/div[1]/h1/a/text()').extract()
    return item

推荐答案

也许你应该在extract()之后添加.encode('utf8')

maybe you should add .encode('utf8') after extract()

这篇关于Scrapy:提取带有特殊字符的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆