如何抓取重定向一段时间的网站 [英] How to scrape a website which redirects for some time

查看:91
本文介绍了如何抓取重定向一段时间的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在显示ddos防护页面时抓取一个延迟5秒的网站,该网站是

Koinex

我正在使用Python3和BeuwtifulSoup,我认为我需要在发送请求之后和读取内容之前引入时间延迟.

这是我到目前为止所做的

import requests
from bs4 import BeautifulSoup
url = 'https://koinex.in/'
response = requests.get(url)
html = response.content 

解决方案

它使用JavaScript生成一些值,该值发送到页面https://koinex.in/cdn-cgi/l/chk_jschl并获取cookie cf_clearance,该值由页面检查以跳过剂量页面.

代码可以在每个请求中使用不同的参数和不同的方法来生成值,因此可以更轻松地使用Selenium来获取数据

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get('https://koinex.in/')

time.sleep(8)

tables = driver.find_elements_by_tag_name('table')

for item in tables:
    print(item.text)
    #print(item.get_attribute("value"))

结果

VOLUME PRICE/ETH
5.2310 64,300.00
0.0930 64,100.00
10.7670 64,025.01
0.0840 64,000.00
0.3300 63,800.00
0.2800 63,701.00
0.4880 63,700.00
0.7060 63,511.00
0.5020 63,501.00
0.1010 63,500.01
1.4850 63,500.00
1.0000 63,254.00
0.0300 63,253.00
VOLUME PRICE/ETH
1.0000 64,379.00
0.0940 64,380.00
0.9710 64,398.00
0.0350 64,399.00
0.7170 64,400.00
0.3000 64,479.00
5.1650 64,480.35
0.0020 64,495.00
0.2000 64,496.00
9.5630 64,500.00
0.4000 64,501.01
0.0400 64,550.00
0.5220 64,600.00
DATE VOLUME PRICE/ETH
31/12/2017, 12:19:29 0.2770 64,300.00
31/12/2017, 12:19:11 0.5000 64,300.00
31/12/2017, 12:18:28 0.3440 64,025.01
31/12/2017, 12:18:28 0.0750 64,026.00
31/12/2017, 12:17:50 0.0010 64,300.00
31/12/2017, 12:17:47 0.0150 64,300.00
31/12/2017, 12:15:45 0.6720 64,385.00
31/12/2017, 12:15:45 0.2000 64,300.00
31/12/2017, 12:15:45 0.0620 64,300.00
31/12/2017, 12:15:45 0.0650 64,199.97
31/12/2017, 12:15:45 0.0010 64,190.00
31/12/2017, 12:15:45 0.0030 64,190.00
31/12/2017, 12:15:25 0.0010 64,190.00

您还可以从Selenium获取HTML并与BeautifulSoup

一起使用

soup = BeautifulSoup(driver.page_source)

,但是Selenium可以使用xpathcss selector和其他方法获取数据,因此大多数情况下无需使用BeautifulSoup

请参阅文档: 4.定位元素


编辑:该代码使用来自Selenium的cookie来加载requests的页面,而DDoS页面没有问题.

问题是该页面使用JavaScript来显示表,因此您无法使用requests + BeautifulSoup来获取它们.但是也许您会找到JavaScript用来获取表数据的url,然后requests可能会有用.

from selenium import webdriver
import time

# --- Selenium ---

url = 'https://koinex.in/'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(8)

#tables = driver.find_elements_by_tag_name('table')
#for item in tables:
#    print(item.text)

# --- convert cookies/headers from Selenium to Requests ---

cookies = driver.get_cookies()

for item in cookies:
    print('name:', item['name'])
    print('value:', item['value'])
    print('path:', item['path'])
    print('domain:', item['domain'])
    print('expiry:', item['expiry'])
    print('secure:', item['secure'])
    print('httpOnly:', item['httpOnly'])
    print('----')

# convert list of dictionaries into dictionary
cookies = {c['name']: c['value'] for c in cookies}

# it has to be full `User-Agent` used in Browser/Selenium (it can't be short 'Mozilla/5.0')
headers = {'User-Agent': driver.execute_script('return navigator.userAgent')}

# --- requests + BeautifulSoup ---

import requests
from bs4 import BeautifulSoup

s = requests.Session()
s.headers.update(headers)
s.cookies.update(cookies)

r = s.get(url)

print(r.text)

soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.find_all('table')

print('tables:', len(tables))

for item in tables:
    print(item.get_text())

I am trying to scrape a website which has a delay of 5 sec while displaying a ddos prevention page, the website is

Koinex

I am using Python3 and BeuwtifulSoup, I think I would need to intrduce a time delayafter sending a request and before reading content.

Here is what I have done so far

import requests
from bs4 import BeautifulSoup
url = 'https://koinex.in/'
response = requests.get(url)
html = response.content 

解决方案

It uses JavaScript to generate some value which is send to page https://koinex.in/cdn-cgi/l/chk_jschl and get cookie cf_clearance which is checked by page to skip doss page.

Code can generate value using different parameters and different methods in every requests so it can be easier to use Selenium to get data

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get('https://koinex.in/')

time.sleep(8)

tables = driver.find_elements_by_tag_name('table')

for item in tables:
    print(item.text)
    #print(item.get_attribute("value"))

Result

VOLUME PRICE/ETH
5.2310 64,300.00
0.0930 64,100.00
10.7670 64,025.01
0.0840 64,000.00
0.3300 63,800.00
0.2800 63,701.00
0.4880 63,700.00
0.7060 63,511.00
0.5020 63,501.00
0.1010 63,500.01
1.4850 63,500.00
1.0000 63,254.00
0.0300 63,253.00
VOLUME PRICE/ETH
1.0000 64,379.00
0.0940 64,380.00
0.9710 64,398.00
0.0350 64,399.00
0.7170 64,400.00
0.3000 64,479.00
5.1650 64,480.35
0.0020 64,495.00
0.2000 64,496.00
9.5630 64,500.00
0.4000 64,501.01
0.0400 64,550.00
0.5220 64,600.00
DATE VOLUME PRICE/ETH
31/12/2017, 12:19:29 0.2770 64,300.00
31/12/2017, 12:19:11 0.5000 64,300.00
31/12/2017, 12:18:28 0.3440 64,025.01
31/12/2017, 12:18:28 0.0750 64,026.00
31/12/2017, 12:17:50 0.0010 64,300.00
31/12/2017, 12:17:47 0.0150 64,300.00
31/12/2017, 12:15:45 0.6720 64,385.00
31/12/2017, 12:15:45 0.2000 64,300.00
31/12/2017, 12:15:45 0.0620 64,300.00
31/12/2017, 12:15:45 0.0650 64,199.97
31/12/2017, 12:15:45 0.0010 64,190.00
31/12/2017, 12:15:45 0.0030 64,190.00
31/12/2017, 12:15:25 0.0010 64,190.00

You can also get HTML from Selenium and use with BeautifulSoup

soup = BeautifulSoup(driver.page_source)

but Selenium can get data using xpath, css selector and other methods so mostly there is no need to use BeautifulSoup

See documentation: 4. Locating Elements


EDIT: this code uses cookies from Selenium to load page with requests and it has no problem with DDoS page.

Problem is that page uses JavaScript to display tables so you can't get them using requests+BeautifulSoup. But maybe you will find urls used by JavaScript to get data for tables and then requests can be useful.

from selenium import webdriver
import time

# --- Selenium ---

url = 'https://koinex.in/'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(8)

#tables = driver.find_elements_by_tag_name('table')
#for item in tables:
#    print(item.text)

# --- convert cookies/headers from Selenium to Requests ---

cookies = driver.get_cookies()

for item in cookies:
    print('name:', item['name'])
    print('value:', item['value'])
    print('path:', item['path'])
    print('domain:', item['domain'])
    print('expiry:', item['expiry'])
    print('secure:', item['secure'])
    print('httpOnly:', item['httpOnly'])
    print('----')

# convert list of dictionaries into dictionary
cookies = {c['name']: c['value'] for c in cookies}

# it has to be full `User-Agent` used in Browser/Selenium (it can't be short 'Mozilla/5.0')
headers = {'User-Agent': driver.execute_script('return navigator.userAgent')}

# --- requests + BeautifulSoup ---

import requests
from bs4 import BeautifulSoup

s = requests.Session()
s.headers.update(headers)
s.cookies.update(cookies)

r = s.get(url)

print(r.text)

soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.find_all('table')

print('tables:', len(tables))

for item in tables:
    print(item.get_text())

这篇关于如何抓取重定向一段时间的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆