从AJAX调用中刮取JSON [英] Scraping JSON from AJAX calls
问题描述
背景
考虑此网址:
base_url ="https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
我要拨打电话号码的ajax:
如果我在控制台的chrome浏览器中在网站上按下按钮,我会获得所需的结果: 如果我打开一个新标签并粘贴 如果我尝试类似的操作: 重击: Python: 我只会收到该站点处理页面的html报错 当我用浏览器打开请求时,我还有更多东西.我还有什么?也许是饼干? 如何使用Bash/Python获得所需的结果? 响应html的代码为200 我尝试用curl遇到同样的html问题. 我注意到,如果我复制浏览器的cookie,并使用浏览器中的所有标题(包括cookie)进行请求,那么我会得到正确的结果 剩下的唯一问题是如何通过Python脚本生成cookie? 首先,您应该创建一个请求会话以存储cookie.然后将http GET请求发送到实际调用ajax请求的页面.如果网站创建了任何cookie,它将在GET响应中发送,并且您的会话将存储该cookie.然后,您可以轻松地使用会话来调用ajax api. 重要说明1:您在原始网站中调用的ajax网址是一个HTTP POST请求!您不应向该网址发送获取请求. 重要说明2:您还必须从网站js代码中提取phoneToken,该代码存储在变量中,例如 示例代码: 运行上面的代码时,将显示以下结果: Considering this url: I want to make the ajax call for the telephone number: If I press the button through the site in my chrome browser in the console I would get the wanted result:
If I open a new tab and paste the If I try something like: Bash: Python: I would just receive the html of the the site's handling page that there is an error. I have something more when I am opening the request with the browser. What more do I have? maybe a cookie? How do I get the wanted result with Bash/Python ? the code of the response html is 200 I have tried with curl I get the same html problem. I have noticed that if I copy the cookie of the browser, and make a request with all the headers INCLUDING the cookie from the browser, I get the correct result
The only question left is how can I generate a cookie through a Python script? First you should create a requests Session to store cookies.
Then send a http GET request to the page that is actually calling the ajax request. If any cookie is created by the website, it is sent in GET response and your sessions stores the cookie.
Then you can easily use the session to call ajax api. Important Note 1:
The ajax url you are calling in the original website is a http POST request! you should not send a get request to that url. Important Note 2:
You also must extract phoneToken from the website js code which is stored in a variable like Sample code: When you run the code above, the result below is displayed:
这篇关于从AJAX调用中刮取JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! ajax_url ="https://www.olx.bg/ajax/misc/contact/phone/7XarI/?pt=e3375d9a134f05bbef9e4ad4f2f6d2f3ad704a55f7955c8e3193a1acde6ca02197caf76ffb56977ce61976927a940318
想要的结果
{"value":"088 *****"}
调试
ajax_url
,我将始终获得空值:
{值":"000 000 000"}
wget $ ajax_url
导入请求json_response = requests.get(ajax_url)
想法
编辑
修复程序的类型.
#我认为最重要的标头是cookie标头= DICT_WITH_HEADERS_FROM_BROWSERjson_response = requests.get(next_url,标头=标头,)
最后一个问题
var phoneToken ='here是pt';
import re汇入要求my_session = requests.Session()#呼叫html网站base_url ="https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"base_response = my_session.get(URL = base_url)断言base_response.status_code == 200#从基本网址响应中提取电话令牌phone_token = re.findall(r'phoneToken \ s = \ s \'(.+)\';',base_response.text)[0]#调用ajax apiajax_path ="/ajax/misc/contact/phone/81i3H/?pt =" + phone_tokenajax_url ="https://www.olx.bg" + ajax_pathajax_headers = {'接受': '*/*','accept-encoding':'gzip,deflate,br','accept-language':'en-US,en; q = 0.9,fa; q = 0.8','sec-fetch-mode':'cors','sec-fetch-site':'same-origin','推荐人':'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html','用户代理':'Mozilla/5.0(X11; Linux x86_64)AppleWebKit/537.36(KHTML,例如Gecko)Chrome/76.0.3809.100 Safari/537.36'}ajax_response = my_session.post(URL = ajax_url,headers = ajax_headers)打印(ajax_response.text)
{"value":"088 558 9937"}
Background
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
ajax_url = "https://www.olx.bg/ajax/misc/contact/phone/7XarI/?pt=e3375d9a134f05bbef9e4ad4f2f6d2f3ad704a55f7955c8e3193a1acde6ca02197caf76ffb56977ce61976790a940332147d11808f5f8d9271015c318a9ae729"
Wanted results
{"value":"088 *****"}
debugging
ajax_url
I would always get empty values:{"value":"000 000 000"}
wget $ajax_url
import requests
json_response= requests.get(ajax_url)
Ideas
Edit
Kind of a fix.
# I think the most important header is the cookie
headers = DICT_WITH_HEADERS_FROM_BROWSER
json_response= requests.get(next_url,
headers=headers,
)
Final question
var phoneToken = 'here is the pt';
import re
import requests
my_session = requests.Session()
# call html website
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
base_response = my_session.get(url=base_url)
assert base_response.status_code == 200
# extract phone token from base url response
phone_token = re.findall(r'phoneToken\s=\s\'(.+)\';', base_response.text)[0]
# call ajax api
ajax_path = "/ajax/misc/contact/phone/81i3H/?pt=" + phone_token
ajax_url = "https://www.olx.bg" + ajax_path
ajax_headers = {
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,fa;q=0.8',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'Referer': 'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
ajax_response = my_session.post(url=ajax_url, headers=ajax_headers)
print(ajax_response.text)
{"value":"088 558 9937"}