如何避免JSON百分比编码和\ u转义? [英] How can I avoid JSON percent-encoding and \u-escaping?

查看:231
本文介绍了如何避免JSON百分比编码和\ u转义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我解析文件时

<html>
    <head><meta charset="UTF-8"></head>
    <body><a href="Düsseldorf.html">Düsseldorf</a></body>
</html>

使用

item = SimpleItem()
item['name'] = response.xpath('//a/text()')[0].extract()
item["url"] = response.xpath('//a/@href')[0].extract()
return item

我最终遇到了\u转义符

[{
    "name": "D\u00fcsseldorf",
    "url": "D\u00fcsseldorf.html"
}]

或带有百分比编码的字符串

or with percent-encoded strings

D%C3%BCsseldorf

项目导出器描述了以及相应的Feed导出程序设置

along with the appropriate feed exporter setting

FEED_EXPORTERS = {
    'json': 'myproj.exporter.UnicodeJsonLinesItemExporter',
}

没有帮助.

如何获取utf-8编码的JSON输出?

How do I get a utf-8-encoded JSON output?

我要重申/扩展一个未回答的问题.

更新:

与Scrapy正交,请注意,未设置

Orthogonal to Scrapy, note that without setting

export PYTHONIOENCODING="utf_8"

运行

> echo { \"name\": \"Düsseldorf\", \"url\": \"Düsseldorf.html\" } > dorf.json
> python -c'import fileinput, json;print json.dumps(json.loads("".join(fileinput.input())),sort_keys=True, indent=4, ensure_ascii=False)' dorf.json > dorf_pp.json

将失败

Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)

更新

发布后,我的问题无法回答. UnicodeJsonLinesItemExporter可以工作,但是管道的另一部分是罪魁祸首:作为漂亮地打印JSON输出的后处理,我使用的是python -m json.tool in.json > out.json.

As posted my question was unanswerable. The UnicodeJsonLinesItemExporter works, but another part of the pipeline was the culprit: As a post-process to pretty-print the JSON output, I was was using python -m json.tool in.json > out.json.

推荐答案

这似乎对我有用

# -*- coding: utf-8 -*-
import scrapy
import urllib

class SimpleItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

class CitiesSpider(scrapy.Spider):
    name = "cities"
    allowed_domains = ["sitercity.info"]
    start_urls = (
        'http://en.sistercity.info/countries/de.html',
    )

    def parse(self, response):
        for a in response.css('a'):
            item = SimpleItem()
            item['name'] = a.css('::text').extract_first()
            item['url'] = urllib.unquote(
                a.css('::attr(href)').extract_first().encode('ascii')
                ).decode('utf8')
            yield item

使用问题中引用的Feed导出程序,它也可以使用其他存储空间

using the feed exporter cited in your question, it worked also using another storage

# -*- coding: utf-8 -*-
import json
import io
import os
from scrapy.contrib.exporter import BaseItemExporter
from w3lib.url import file_uri_to_path

class CustomFileFeedStorage(object):

    def __init__(self, uri):
        self.path = file_uri_to_path(uri)

    def open(self, spider):
        dirname = os.path.dirname(self.path)
        if dirname and not os.path.exists(dirname):
            os.makedirs(dirname)
        return io.open(self.path, mode='ab')

    def store(self, file):
        file.close()

class UnicodeJsonLinesItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)

    def export_item(self, item):
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict) + '\n')

(如有必要,删除评论)

(removing the comments if necessary)

FEED_EXPORTERS = {
    'json': 'myproj.exporter.UnicodeJsonLinesItemExporter'
}
#FEED_STORAGES = {
#   '': 'myproj.exporter.CustomFileFeedStorage'
#}
FEED_FORMAT = 'json'
FEED_URI = "out.json"

这篇关于如何避免JSON百分比编码和\ u转义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆