从AJAX调用中刮取JSON [英] Scraping JSON from AJAX calls

查看:70
本文介绍了从AJAX调用中刮取JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

考虑此网址:

base_url ="https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"

我要拨打电话号码的ajax:

ajax_url ="https://www.olx.bg/ajax/misc/contact/phone/7XarI/?pt=e3375d9a134f05bbef9e4ad4f2f6d2f3ad704a55f7955c8e3193a1acde6ca02197caf76ffb56977ce61976927a940318

想要的结果

如果我在控制台的chrome浏览器中在网站上按下按钮,我会获得所需的结果:

  {"value":"088 *****"} 

调试

如果我打开一个新标签并粘贴 ajax_url ,我将始终获得空值:

  {值":"000 000 000"} 

如果我尝试类似的操作:

重击:

wget $ ajax_url

Python:

 导入请求json_response = requests.get(ajax_url) 

我只会收到该站点处理页面的html报错

.

想法

当我用浏览器打开请求时,我还有更多东西.我还有什么?也许是饼干?

如何使用Bash/Python获得所需的结果?

编辑

响应html的代码为200

我尝试用curl遇到同样的html问题.

修复程序的类型.

我注意到,如果我复制浏览器的cookie,并使用浏览器中的所有标题(包括cookie)进行请求,那么我会得到正确的结果

 #我认为最重要的标头是cookie标头= DICT_WITH_HEADERS_FROM_BROWSERjson_response = requests.get(next_url,标头=标头,) 

最后一个问题

剩下的唯一问题是如何通过Python脚本生成cookie?

解决方案

首先,您应该创建一个请求会话以存储cookie.然后将http GET请求发送到实际调用ajax请求的页面.如果网站创建了任何cookie,它将在GET响应中发送,并且您的会话将存储该cookie.然后,您可以轻松地使用会话来调用ajax api.

重要说明1:您在原始网站中调用的ajax网址是一个HTTP POST请求!您不应向该网址发送获取请求.

重要说明2:您还必须从网站js代码中提取phoneToken,该代码存储在变量中,例如 var phoneToken ='here是pt';

示例代码:

  import re汇入要求my_session = requests.Session()#呼叫html网站base_url ="https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"base_response = my_session.get(URL = base_url)断言base_response.status_code == 200#从基本网址响应中提取电话令牌phone_token = re.findall(r'phoneToken \ s = \ s \'(.+)\';',base_response.text)[0]#调用ajax apiajax_path ="/ajax/misc/contact/phone/81i3H/?pt =" + phone_tokenajax_url ="https://www.olx.bg" + ajax_pathajax_headers = {'接受': '*/*','accept-encoding':'gzip,deflate,br','accept-language':'en-US,en; q = 0.9,fa; q = 0.8','sec-fetch-mode':'cors','sec-fetch-site':'same-origin','推荐人':'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html','用户代理':'Mozilla/5.0(X11; Linux x86_64)AppleWebKit/537.36(KHTML,例如Gecko)Chrome/76.0.3809.100 Safari/537.36'}ajax_response = my_session.post(URL = ajax_url,headers = ajax_headers)打印(ajax_response.text) 

运行上面的代码时,将显示以下结果:

  {"value":"088 558 9937"} 

Background

Considering this url:

base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"

I want to make the ajax call for the telephone number:

ajax_url = "https://www.olx.bg/ajax/misc/contact/phone/7XarI/?pt=e3375d9a134f05bbef9e4ad4f2f6d2f3ad704a55f7955c8e3193a1acde6ca02197caf76ffb56977ce61976790a940332147d11808f5f8d9271015c318a9ae729"

Wanted results

If I press the button through the site in my chrome browser in the console I would get the wanted result:

{"value":"088 *****"}

debugging

If I open a new tab and paste the ajax_url I would always get empty values:

{"value":"000 000 000"}

If I try something like:

Bash:

wget $ajax_url

Python:

import requests


json_response= requests.get(ajax_url)

I would just receive the html of the the site's handling page that there is an error.

Ideas

I have something more when I am opening the request with the browser. What more do I have? maybe a cookie?

How do I get the wanted result with Bash/Python ?

Edit

the code of the response html is 200

I have tried with curl I get the same html problem.

Kind of a fix.

I have noticed that if I copy the cookie of the browser, and make a request with all the headers INCLUDING the cookie from the browser, I get the correct result

# I think the most important header is the cookie
headers = DICT_WITH_HEADERS_FROM_BROWSER
json_response= requests.get(next_url,
                            headers=headers,
                            )

Final question

The only question left is how can I generate a cookie through a Python script?

解决方案

First you should create a requests Session to store cookies. Then send a http GET request to the page that is actually calling the ajax request. If any cookie is created by the website, it is sent in GET response and your sessions stores the cookie. Then you can easily use the session to call ajax api.

Important Note 1: The ajax url you are calling in the original website is a http POST request! you should not send a get request to that url.

Important Note 2: You also must extract phoneToken from the website js code which is stored in a variable like var phoneToken = 'here is the pt';

Sample code:

import re
import requests

my_session = requests.Session()

# call html website
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
base_response = my_session.get(url=base_url)
assert base_response.status_code == 200

# extract phone token from base url response
phone_token = re.findall(r'phoneToken\s=\s\'(.+)\';', base_response.text)[0]

# call ajax api
ajax_path = "/ajax/misc/contact/phone/81i3H/?pt=" + phone_token
ajax_url = "https://www.olx.bg" + ajax_path
ajax_headers = {
    'accept': '*/*',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9,fa;q=0.8',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'Referer': 'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
ajax_response = my_session.post(url=ajax_url, headers=ajax_headers)

print(ajax_response.text)

When you run the code above, the result below is displayed:

{"value":"088 558 9937"}

这篇关于从AJAX调用中刮取JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆