ValueError:无效\ escape:在Scrapy中读入json作为响应时 [英] ValueError: Invalid \escape: When readin json as respons in Scrapy

查看:106
本文介绍了ValueError:无效\ escape:在Scrapy中读入json作为响应时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在解析过程中,我得到带有json的文本对象响应.它们看起来非常相似.其中一些工作没有任何错误.但是其他人则抛出如下错误.

During parsing i get text object response with json in it. They all look very much alike. And some of them work without any errors. But others throw an error as below.

我试图使用replace('\r\n', ''), strict=False.无济于事.

I tried to use replace('\r\n', '') and , strict=False. To no avail.

这是我从以下网址获取json的网址-在此处输入链接说明 这是我的代码. (第51行是data=json.loads)

Here is the URL i get json from - enter link description here Here is my code. (Line 51 is data=json.loads)

另外,当我在scrapy shell中尝试此url时,它打开为空并引发另一个错误-找不到json文档.不知道这是否重要.

Also when i try this url in scrapy shell it opens up empty and throw another error - no json document located. Do not know if this is important.

def parse_jsn(self, response):
        #inspect_response(response, self)

        data = json.loads(response.body_as_unicode())
        item = response.meta['item']
        item['text']= data[0]['bodyfull']
        yield item

这是错误代码.

ValueError: Invalid \escape: line 4 column 942 (char 945)
2017-03-25 17:21:19 [scrapy.core.scraper] ERROR: Spider error processing <GET
or.com/UserReviewController?a=mobile&r=434622632> (referer: https://www.tripa
w-g60763-d122005-Reviews-or490-The_New_Yorker_A_Wyndham_Hotel-New_York_City_N
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in it
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", l
der_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", l
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py",

    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", lin
    return (r for r in result or () if _filter(r))
  File "C:\Code\Active\tripadvisor\tripadvisor\spiders\mtripad.py", line 51,
    data = json.loads(response.body_as_unicode(), strict=False)
  File "c:\python27\lib\json\__init__.py", line 352, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "c:\python27\lib\json\decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "c:\python27\lib\json\decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 4 column 579 (char 582)

推荐答案

首先, +1 用于抓取移动API.比从HTML抓取要聪明得多!

First of all, +1 for scraping the mobile API. Much more clever than scraping from HTML!

确实存在编码问题.有些八进制编码字符([...] \074br/\076\074br/\076Best Regards,\074br/\076Emily [...])破坏了JSON解析.要摆脱它们,请使用:

Indeed there is a issue with the encoding.There are some octal encoded characters ([...] \074br/\076\074br/\076Best Regards,\074br/\076Emily [...]) that breaks the JSON parsing. To get rid of them use:

response.body.decode('unicode-escape')

数据中还包含一些编码的HTML字符:"&#x201c;Nice clean and perfectly average&#x201d;".我建议取消转义:

Also there are some encoded HTML characters in the data: "&#x201c;Nice clean and perfectly average&#x201d;". I suggest to unescape them:

from HTMLParser import HTMLParser
...
json.loads(HTMLParser().unescape(response.body.decode('unicode-escape'))
...

在Python 3中:

In Python 3:

import html 
...
json.loads(html.unescape(response.body.decode('unicode-escape')))

结果应类似于:[{'title': 'Nice clean and perfectly average', 'bodyfull': '[...] stay. <br/><br/>Best Regards,<br/>Emily Rodriguez", [...]}]

如您所见,结果中包含一些HTML标记.如果您想删除HTML标记,则可以使用RegEx之类的方法:

As you see, there is some HTML tags in the result. If you want to remove the HTML tags you could use a RegEx like:

import re
...
p = re.compile(r'<.*?>')
no_html = p.sub('', str_html))

这篇关于ValueError:无效\ escape:在Scrapy中读入json作为响应时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆