搜寻网站(marketchameleon)返回加密的数据 [英] Scraping website (marketchameleon) returns encrypted data
问题描述
我正在学习如何使用python抓取网站,目前仅使用请求和BeautifulSoup ...
I am learning how to scrape websites with python, for now just been using requests and BeautifulSoup...
我正在尝试访问以下页面: https://marketchameleon.com/概述/BAX/收入/收入日期
I am trying to access the following page: https://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates
是的,您需要订阅才能查看所有数据,但这只是出于学习目的,因此在浏览器中可见的少量数据就足够了.
Yes, you need a subscription to see all data, but it is just for learning purposes so the few data that is visible in the browser should be enough.
在这里我如何获取数据:
Heres how I get the data:
import requests
import urllib.request
from bs4 import BeautifulSoup
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
url = 'https://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates'
response = requests.get(url, headers_Get)
soup = BeautifulSoup(response.text, "html.parser")
但是,返回的html数据似乎已加密(因为提取的部分很长,所以只是摘录):
However, the html data returned seems to be encrypted (just an extract as the encrypted part is quite long):
<div class="symov_earnings">
<div class="flex_container_between flex_center_vertical">
<div class="dl-tbl-outer"><div class="dis-prem"><button class="_noprem prem-btn" onclick="site_OpenPremium();">Download Now</button><div class="dis-prem-pop"><p>Premium Feature</p><p><a href="/Account/Login">Login</a><span>|</span><a href="/Subscription/Compare">Subscribe</a></p></div></div></div>
</div>
<div cipherxx="OwA+ADwAOQA+ADwABABEAFcAVgBdAFYAEwBfAFwADQAUAEcASABeAGwAUABNAEQAaQBRAFAAQQBdAF8AVgBXAEUAFgARAFAAXwBXAEsAQwALABYAXABDAGwAWgBRAFcAXgBAAFMAXABBAFIAXQBCABQACgA8ADkAEwAWABgAEAAKAEAAWQBWAFIAUgAGAD0APAAUABEAEwATABYAGAAQABYACABFAEEAEwBVAFQAUQBFAEcADAARAF4AVwBRAF4AaQBcAFQAUgBXAF8AVgBXABQACgA8ADkAEwAWABgAEAAWABQAEQATABMAFgAYABAACgBAAFkAEwBQAFkAVABDAEYAVQBfAA4AEQAOABoADgBjAEQAUgBcAF4AXwBWAFcAFgBxAFAAQQBdAF8AVgBXAEUACAAeAEcAWwAIADUAOgAWABQAEQATABMAFgAYABAACgAbAEUAQQANA
有什么办法找出正在发生的事情(如何保护网站不受刮擦?)并获取实际的html数据?
Is there any way to find out what is happening (how the site is protected from scrapers?) and to get the actual html data?
谢谢
推荐答案
数据确实已加密.如果您查看属于网站的JS文件,则可以发现此特定文件/a>,其中包含用于解密数据的功能.所有这些操作都是用Javascript完成的,因此您在这里有2个选择:
The data are indeed encrypted. If you look at the JS files that are part of the website you can spot this particular file which contain the function used to decrypt the data. All this is done in Javascript so you have 2 options here :
- 使用beautifulsoup 抓取页面,重新编码python中的javascript解密功能
- 使用诸如硒 的问题的无头浏览器
- use beautifulsoup to scrap the page, recode the javascript decryption function in python
- use a headless browser like selenium
使用第一个选项(在python ),这是您可以执行的操作:
Using the first option (recoding the encryption function in python), here is how you could do that :
import requests
from bs4 import BeautifulSoup
import base64
import json
url = "https://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates"
session = requests.Session()
r = session.get(url)
soup = BeautifulSoup(r.text, "html.parser")
key = session.cookies.get_dict()["v1"]
encryptedDivs = [ i["cipherxx"] for i in soup.find_all("div") if i.get("cipherxx")]
unencrypted = []
for div in encryptedDivs:
encryptedData = base64.b64decode(div)
cipher = "".join([
chr(encryptedData[i])
for i in range(0,len(encryptedData),2)
])
data = ""
for i in range(0, len(cipher)):
c_num = ord(cipher[i])
k_num = ord(key[i % len(key)])
c2 = c_num ^ k_num
data += chr(c2)
unencrypted.append(data)
# unencrypted[0] is the header div with some info about stock price etc...
# unencrypted[1] is the first table
# lets parse the second table unencrypted[2]
soup = BeautifulSoup(unencrypted[2], "html.parser")
tbody = soup.find("tbody").findAll("tr", recursive=False)
thead = soup.find("thead").findAll("tr", recursive=False)
table2 = [
{
"Date": t[0].text.strip(),
"Time": t[1].text.strip(),
"Period": t[2].text.strip(),
"Conference Call": t[3].text.strip(),
"Price Effect" : t[4].find("span").text if t[4].find("span") else t[4].text.strip(),
"Implied Straddle": t[5].text.strip(),
"Closing Price": t[6].text.strip(),
"Opening Gap": t[7].text.strip(),
"Drift Since": t[8].text.strip(),
"Range Since": t[9].text.strip(),
"Price Change 1 Week Before":t[10].text.strip(),
"Price Change 1 Week After": t[11].text.strip()
}
for t in (t.findAll('td', recursive=False) for t in tbody)
if len(t) >= 11
]
print(json.dumps(table2, indent=4, sort_keys=True))
请注意,加密密钥位于名为 v1
的cookie中(这就是为什么需要 requests.Session()
)的原因
Note that the encryption key is located in the cookie named v1
(which is why you need requests.Session()
)
这是 XOR加密.它将数据的值与键(在这种情况下,键存储在cookie中)进行异或.对于解密,您只需要对密码和密钥进行异或运算即可将原始数据取回.
This is XOR encryption. It XOR the value of the data with a key (in this case the key is stored in a cookie). For the decryption, you just need to XOR the cipher with the key to get the original data back.
解释它的最有效方法是使用示例:
The most efficient way to explain it is to use an example :
- 数据是字符串"HELLO"
- 键是字符串"97523022"
"H" "E" "L" "L" "O"
72 69 76 76 79
01001000 01000101 01001100 01001100 01001111
"9" "7" "5" "2" "3"
57 55 53 50 51
00111001 00110111 00110101 00110010 00110011
01001000 01000101 01001100 01001100 01001111
XOR 00111001 00110111 00110101 00110010 00110011
==> 01110001 01110010 01111001 01111110 01111100
113 114 121 126 124
HEX \x71 \x72 \x79 \x7E \x7C
complete with 0s :
HEX \x71\x00 \x72\x00 \x79\x00 \x7E\x00 \x7C\x00
encode \x71\x00\x72\x00\x79\x00\x7E\x00\x7C\x00 to base64
which gives : 'cQByAHkAfgB8AA=='
尝试将此代码解密(与问题开头的代码相同):
try this code to decrypt (same code as the code at the beginning of the question) :
key = "97523022"
payload = "cQByAHkAfgB8AA=="
data = base64.b64decode(payload)
cipher = "".join([
chr(data[i])
for i in range(0,len(data),2)
])
data = ""
for i in range(0, len(cipher)):
c_num = ord(cipher[i])
k_num = ord(key[i % len(key)])
c2 = c_num ^ k_num
data += chr(c2)
print(data)
输出:
你好
您还可以检查此链接和
You can also check this link and this wiki if your are interested
这篇关于搜寻网站(marketchameleon)返回加密的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!