Scrapy:POST请求返回JSON响应(200 OK),但数据不完整 [英] Scrapy: POST request returning JSON response (200 OK) but with incomplete data

查看:740
本文介绍了Scrapy:POST请求返回JSON响应(200 OK),但数据不完整的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

MySpider正试图描绘加载更多动作点击,这会导致在网页上动态加载更多项目。并且这一直持续到没有更多的东西要加载。

MySpider is trying to depict the load-more action click which results in loading of more items on web-page dyanamically. And this continues until nothing more is left to be loaded.

yield FormRequest(url,headers=header,formdata={'entity_id': '70431','profile_action': 'review-top','page':str(p), 'limit': '5'},callback=self.parse_review)

header = {#'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0',
               'X-Requested-With': 'XMLHttpRequest',
               'Host': 'www.zomato.com',
               'Accept': '*/*',
               'Referer': 'https://www.zomato.com',
               'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
               'dont_filter':'True' }

url = 'https://www.zomato.com/php/social_load_more.php'

收到的回复是json回复。

The response received is the json response.

jsonresponse = json.load(response)

我确实看到了 -

('data==', {u'status': u'success', u'left_count': 0, u'html': u"<script type='text/javascript'>if (typeof initiateLaziness == 'function') initiateLaziness() </script>", u'page': u'1', u'more': 0})

你看我收到状态,left_count,页面等的回复。
但是我对'html'感兴趣。不幸的是,它是 in-correct 值,如果通过浏览器完成,我会收到该值(检查网络电话并验证)

U see i get response for status, left_count, page, more. However i am interested in 'html'. Unfortunately, its the in-correct value which i do receive if done through browser(inspected the network calls and verified)

预期'html'是----

Expected 'html' is ----

<div><a> very long html stuff...............................................<div><script type='text/javascript'>if (typeof initiateLaziness == 'function') initiateLaziness() </script>

我只收到以后的部分

<script>...................................</script>. 

缺少真正的html内容。

Real html stuff is missing.

需要注意的是,我确实收到了回复,但仅仅是html的不完整版本。对休息有好处。我相信它可能与动态生成的html有关。但是我得到了任何线索。

Thing to note is that i do receive response but incomplete one for 'html' only.All good for rest. I believe it might be something related to dynamically generated html. But i am getting any clue on it.

scrapy中间件没有添加内容长度。并且不允许我添加一个。将标题添加到标题时,响应失败为400.

No content-length is added by scrapy middleware. And not allowing me to add one as well. Respons fails with 400 when adding it to header.

请求标题实际发送到服务器:

Request Header being actually sent to server:

 {'Accept-Language': ['en'], 'Accept-Encoding': ['gzip, deflate,br'], 'Dont_Filter': ['True'], 'Connection': ['keep-alive'], 'Accept': ['*/*'], 'User-Agent': ['Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0'], 'Host': ['www.zomato.com'], 'X-Requested-With': ['XMLHttpRequest'], 'Cookie': ['zl=en; fbtrack=9be27330646d24088c56c2531ea2fbf5; fbcity=7; PHPSESSID=2338004ce3fd540477242c3eaee685168163bd05'], 'Referer': ['https://www.zomato.com'], 'Content-Type': ['application/x-www-form-urlencoded; charset=UTF-8']})

如果我在这里丢失任何东西,请任何人帮助我?
或者某种程度上我可以发送内容长度/发送中间件给我发送吗?
非常感谢。

Can any one please help me if i am missing anything here? Or someway i can sent the send the content-length/or make middleware sent it for me? Many Thanks.

推荐答案

由于不使用cookie,您将无法获得html内容作为回应。在您提到的实际请求标头中,有一个cookie属性。但是在您通过代码发送的ajax请求中,没有cookie字段。

You won't get the html content in response because of not using cookies. In the actual request header that you have mentioned, there is a cookie attribute. But in the ajax request you are sending through your code, there is no cookie field.

首先在对zomato餐厅页面的请求的响应中设置cookie网址: https://www.zomato.com/city/restaurant/reviews 。现在,当点击加载更多按钮时,将发送一个请求,其中包含服务器在上一个响应中设置的cookie的字段' https://www.zomato.com/php/social_load_more.php '。因此,每次发出ajax请求时,应在请求标头中发送上一个响应中设置的cookie,并在当前请求的响应中设置新的cookie。

First a cookie is set in the response to the request made from zomato's restaurant page with the url: https://www.zomato.com/city/restaurant/reviews. Now, when the load more button is clicked, a request is sent with the cookie field containing the cookie set by the server in the previous response to the url 'https://www.zomato.com/php/social_load_more.php'. So, everytime an ajax request is made, the cookie that was set in the previous response should be sent in the request header and a new cookie will be set in the response of the present request.

因此,为了管理这些cookie,我使用了请求包的会话对象。脚本也可以在不使用scrapy的情况下编写。当您在scrapy中编写代码时,查看是否有任何会话对象可用于管理scrapy的cookie。

So, in order to manage these cookies, I used session object of requests package. The script can be written without using scrapy also. As you wrote your code in scrapy, see if there are any session objects available to manage the cookies for scrapy.

我的代码:

import requests
url : 'https://www.zomato.com/city/restaurant/reviews' 
s = requests.Session()
resp = s.get(url, headers=header) 

上面的代码是将请求发送到餐厅评论的网址。这很重要,因为第一个cookie是在对此请求的响应中设置的。

The above code is to send requests to the url of the restaurant reviews. This is essential because the first cookie is set in the response to this request.

params={
        'entity_id':res_id,
        'profile_action':'reviews-dd',
        'page':'1',
        'limit':'5'
    }
header = {"origin":"https://www.zomato.com","Referer":"https://www.zomato.com/","user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0", "x-requested-with":"XMLHttpRequest", 'Accept-Encoding': 'gzip, deflate, br'}
loadreviews_text = s.post("https://www.zomato.com/php/social_load_more.php", data=params, headers=header)
loadreviews = loadreviews_text.json()

现在向social_load_more.php发出请求。对象''管理cookie。变量loadreviews现在将具有json格式的html数据。

Now a request is made to the social_load_more.php. The object 's' manages the cookies. The variable loadreviews will now have the html data in json format.

这篇关于Scrapy:POST请求返回JSON响应(200 OK),但数据不完整的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆