Scrapy:提取带有特殊字符的文本 [英] Scrapy: extract text with special characters
问题描述
我正在使用 Scrapy 从一些西班牙网站提取文本.显然,文本是用西班牙语写的,有些单词有特殊字符,如ñ"或í".我的问题是,当我在命令行中运行时:scrapy crawl econoticia -o prueba.json要获取包含抓取数据的文件,某些字符未以正确方式显示.例如:这是原文La exministra,procesada como partícipe a titulo lucrativo,intenta burlar a los fotógrafos"这是刮掉的文字La exministra,procesada como part\u00edcipe a titulo lucrativo,intenta burlar a los fot\u00f3grafos"我希望返回一个带有特殊字符的 json.我认为我的 spyder 代码需要一些东西来以正确的方式获取 json.这是我的间谍代码:
I'm using Scrapy for extract text from some spanish websites. Obviously, the text is written in spanish and some words have special characters like 'ñ' or 'í'. My problem is that when I run in the command line: scrapy crawl econoticia -o prueba.json to get the file with the scraped data, some characters are not shown in a proper way. For example: This is the original text "La exministra, procesada como partícipe a titulo lucrativo, intenta burlar a los fotógrafos" And this is the text scraped "La exministra, procesada como part\u00edcipe a titulo lucrativo, intenta burlar a los fot\u00f3grafos" I wish to return a json with the special characters. I presume that my spyder code need something to get the json in the right way. This is my spyder code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from pais.items import PaisItem
class NoticiaSpider(scrapy.Spider):
name = "noticia"
allowed_domains = ["elpais.com"]
start_urls = (...
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
item= PaisItem()
item['subtitulo']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[1]/span/text()').extract()
item['titular']=hxs.select('//*[@id="merc"]/div[2]/div[4]/div[1]/div[3]/div[2]/div[1]/h1/a/text()').extract()
return item
推荐答案
也许你应该在extract()之后添加.encode('utf8')
maybe you should add .encode('utf8') after extract()
这篇关于Scrapy:提取带有特殊字符的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!