搜寻网站(marketchameleon)返回加密的数据 [英] Scraping website (marketchameleon) returns encrypted data

查看:43
本文介绍了搜寻网站(marketchameleon)返回加密的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习如何使用python抓取网站,目前仅使用请求和BeautifulSoup ...

I am learning how to scrape websites with python, for now just been using requests and BeautifulSoup...

我正在尝试访问以下页面: https://marketchameleon.com/概述/BAX/收入/收入日期

I am trying to access the following page: https://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates

是的,您需要订阅才能查看所有数据,但这只是出于学习目的,因此在浏览器中可见的少量数据就足够了.

Yes, you need a subscription to see all data, but it is just for learning purposes so the few data that is visible in the browser should be enough.

在这里我如何获取数据:

Heres how I get the data:

import requests
import urllib.request
from bs4 import BeautifulSoup
headers_Get = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}
url = 'https://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates'
response = requests.get(url, headers_Get)
soup = BeautifulSoup(response.text, "html.parser")

但是,返回的html数据似乎已加密(因为提取的部分很长,所以只是摘录):

However, the html data returned seems to be encrypted (just an extract as the encrypted part is quite long):

<div class="symov_earnings">
<div class="flex_container_between flex_center_vertical">
<div class="dl-tbl-outer"><div class="dis-prem"><button class="_noprem prem-btn" onclick="site_OpenPremium();">Download Now</button><div class="dis-prem-pop"><p>Premium Feature</p><p><a href="/Account/Login">Login</a><span>|</span><a href="/Subscription/Compare">Subscribe</a></p></div></div></div>
</div>
<div cipherxx="OwA+ADwAOQA+ADwABABEAFcAVgBdAFYAEwBfAFwADQAUAEcASABeAGwAUABNAEQAaQBRAFAAQQBdAF8AVgBXAEUAFgARAFAAXwBXAEsAQwALABYAXABDAGwAWgBRAFcAXgBAAFMAXABBAFIAXQBCABQACgA8ADkAEwAWABgAEAAKAEAAWQBWAFIAUgAGAD0APAAUABEAEwATABYAGAAQABYACABFAEEAEwBVAFQAUQBFAEcADAARAF4AVwBRAF4AaQBcAFQAUgBXAF8AVgBXABQACgA8ADkAEwAWABgAEAAWABQAEQATABMAFgAYABAACgBAAFkAEwBQAFkAVABDAEYAVQBfAA4AEQAOABoADgBjAEQAUgBcAF4AXwBWAFcAFgBxAFAAQQBdAF8AVgBXAEUACAAeAEcAWwAIADUAOgAWABQAEQATABMAFgAYABAACgAbAEUAQQANA

有什么办法找出正在发生的事情(如何保护网站不受刮擦?)并获取实际的html数据?

Is there any way to find out what is happening (how the site is protected from scrapers?) and to get the actual html data?

谢谢

推荐答案

数据确实已加密.如果您查看属于网站的JS文件,则可以发现此特定文件/a>,其中包含用于解密数据的功能.所有这些操作都是用Javascript完成的,因此您在这里有2个选择:

The data are indeed encrypted. If you look at the JS files that are part of the website you can spot this particular file which contain the function used to decrypt the data. All this is done in Javascript so you have 2 options here :

  • 使用抓取页面,重新编码python中的javascript解密功能
  • 使用诸如 的问题的无头浏览器
  • use beautifulsoup to scrap the page, recode the javascript decryption function in python
  • use a headless browser like selenium

使用第一个选项(在),这是您可以执行的操作:

Using the first option (recoding the encryption function in python), here is how you could do that :

import requests
from bs4 import BeautifulSoup
import base64
import json

url = "https://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates"

session = requests.Session()

r = session.get(url)
soup = BeautifulSoup(r.text, "html.parser")

key = session.cookies.get_dict()["v1"]
encryptedDivs = [ i["cipherxx"] for i in soup.find_all("div") if i.get("cipherxx")]

unencrypted = []
for div in encryptedDivs:
    encryptedData = base64.b64decode(div)
    cipher = "".join([
        chr(encryptedData[i]) 
        for i in range(0,len(encryptedData),2)
    ])
    data = ""
    for i in range(0, len(cipher)):
        c_num = ord(cipher[i])
        k_num = ord(key[i % len(key)])
        c2 = c_num ^ k_num
        data += chr(c2)

    unencrypted.append(data)

# unencrypted[0] is the header div with some info about stock price etc...
# unencrypted[1] is the first table
# lets parse the second table unencrypted[2]

soup = BeautifulSoup(unencrypted[2], "html.parser")

tbody = soup.find("tbody").findAll("tr", recursive=False)
thead = soup.find("thead").findAll("tr", recursive=False)

table2 = [
    {
        "Date": t[0].text.strip(),
        "Time": t[1].text.strip(),
        "Period": t[2].text.strip(),
        "Conference Call": t[3].text.strip(),
        "Price Effect" : t[4].find("span").text if t[4].find("span") else t[4].text.strip(),
        "Implied Straddle": t[5].text.strip(),
        "Closing Price": t[6].text.strip(),
        "Opening Gap": t[7].text.strip(),
        "Drift Since": t[8].text.strip(),
        "Range Since": t[9].text.strip(),
        "Price Change 1 Week Before":t[10].text.strip(),
        "Price Change 1 Week After": t[11].text.strip()
    }
    for t in (t.findAll('td', recursive=False) for t in tbody)
    if len(t) >= 11
]

print(json.dumps(table2, indent=4, sort_keys=True))

请注意,加密密钥位于名为 v1 的cookie中(这就是为什么需要 requests.Session())的原因

Note that the encryption key is located in the cookie named v1 (which is why you need requests.Session())

这是 XOR加密.它将数据的值与键(在这种情况下,键存储在cookie中)进行异或.对于解密,您只需要对密码和密钥进行异或运算即可将原始数据取回.

This is XOR encryption. It XOR the value of the data with a key (in this case the key is stored in a cookie). For the decryption, you just need to XOR the cipher with the key to get the original data back.

解释它的最有效方法是使用示例:

The most efficient way to explain it is to use an example :

  • 数据是字符串"HELLO"
  • 键是字符串"97523022"
"H"       "E"        "L"        "L"        "O"
 72        69         76         76         79
 01001000  01000101   01001100   01001100   01001111


"9"       "7"        "5"        "2"        "3"
 57        55         53         50         51
 00111001  00110111   00110101   00110010   00110011

     01001000  01000101   01001100   01001100   01001111
XOR  00111001  00110111   00110101   00110010   00110011
==>  01110001  01110010   01111001   01111110   01111100         
       113        114       121        126        124
HEX   \x71       \x72      \x79       \x7E       \x7C


complete with 0s  :
HEX    \x71\x00 \x72\x00 \x79\x00 \x7E\x00 \x7C\x00

encode \x71\x00\x72\x00\x79\x00\x7E\x00\x7C\x00 to base64

which gives : 'cQByAHkAfgB8AA=='

尝试将此代码解密(与问题开头的代码相同):

try this code to decrypt (same code as the code at the beginning of the question) :

key = "97523022"
payload = "cQByAHkAfgB8AA=="

data = base64.b64decode(payload)

cipher = "".join([
    chr(data[i]) 
    for i in range(0,len(data),2)
])
data = ""
for i in range(0, len(cipher)):
    c_num = ord(cipher[i])
    k_num = ord(key[i % len(key)])
    c2 = c_num ^ k_num
    data += chr(c2)

print(data)

输出:

你好

您还可以检查此链接

You can also check this link and this wiki if your are interested

这篇关于搜寻网站(marketchameleon)返回加密的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆