使用 Selenium & 抓取后从 HTML 源中提取数据表Python [英] Extracting data tables from HTML source after scraping using Selenium & Python

查看:21
本文介绍了使用 Selenium & 抓取后从 HTML 源中提取数据表Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此链接中抓取数据.我已经研究了被问到的问题,并且我已经成功地进行了一些抓取.但是我在生成的结果中几乎没有问题.以下是我用来抓取的一段代码.

I am trying to scrape data from this link. I've researched on question that are asked and I've successfully did some scraping. But I've few issues in results that are generated. Following is the piece of code that I've used to scrape.

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from datetime import datetime
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

options = Options() 
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get('http://www.scstrade.com/MarketStatistics/MS_HistoricalIndices.aspx') 

inputElement_index = driver.find_element_by_id("txtSearch")
inputElement_index.send_keys('KSE ALL')


inputElement_date = driver.find_element_by_id("date1")
inputElement_date.send_keys('03/12/2019')

inputElement_date_end = driver.find_element_by_id("date2")
inputElement_date_end.send_keys('03/12/2020')

inputElement_viewprice = driver.find_element_by_id("btn1")
inputElement_viewprice.send_keys(Keys.ENTER)

tabel = driver.find_elements_by_css_selector('table > tbody')[0]

目的是从链接中提取日期为 2020 年 3 月 12 日至2020 年 3 月 3 日,索引 KSE ALL.现在上面的代码可以工作,但是当代码第一次运行时,代码表对象的最后一行是空白的,如果我重新运行最后一行,它会以字符串格式提供第一页上的表格.我想知道为什么我第一次运行代码时没有得到表?如何为字符串中的表对象获取pandas DataFrame?

Aim is to extract data from the link with dates between 12th Mar 2020 to 03rd Mar 2020, with indices KSE ALL. Now the above code works but in the last line of the code table object is blank when the code runs for the first time if I re-run this last line it gives the table in string format that is on the 1st page. I want to know why don't I get the table when the code runs for the first time? How can I get a pandas DataFrame for the table object which is in string?

我尝试了以下代码将第一页数据导入到 Pandas DataFrame 中.但结果表对象是 'NoneType'.

I tried the following code to get 1st page data into pandas DataFrame. But the table object turns out to be 'NoneType'.

htmlSource = driver.page_source
soup = BeautifulSoup(htmlSource, 'html.parser')
table = soup.find('table', class_='tbody')

其次,我想提取整个数据,而不仅仅是第一页上的数据,页数是动态的,它们会随着日期范围的变化而变化.现在转到下一页,我尝试了以下代码:

Second, I want to extract entire data, not just the data on first page and number of pages would be dynamic they would change as date range changes. Now to move to next page I tried the following piece of code:

driver.find_element_by_id("next_pager").click()

我收到以下错误.

selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <td id="next_pager" class="ui-pg-button" title="Next Page">...</td> is not clickable at point (790, 95). Other element would receive the click: <div class="loading row" id="load_list" style="display: block;">...</div>

我试图查找如何解决这个问题,写了下面的代码来增加一些等待时间.但是得到了和上面一样的错误.

I tried to look up on how can this issue be resolved wrote the code below to add some waiting time. But got the same error as above.

wait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[title="Next Page"]'))).click()

如何移动到后续页面并从所有页面中提取数据(根据日期范围设置,页面数量将是动态的)并将其附加到从上一页提取的数据中?

推荐答案

在这种情况下,我更喜欢使用 api 方法,这样可以更快更容易地获取数据.而且您不必加载表格中的页数.以下是获取响应代码的 API 代码(只是更改了日期范围以确保您在一次请求调用中会看到多个页面数据)

I would rather prefer using the api approach in this case, it would be faster and easy to get the data. And also you don't have to load number of pages in the table. Below is the API code to get the response code (just changed the date range to make sure you will see multiple pages data in one request call)

import requests

url = "http://www.scstrade.com/MarketStatistics/MS_HistoricalIndices.aspx/chart"

payload = "{\"par\": \"KSE All\", \"date1\": \"01/03/2020\",\"date2\": \"03/12/2020\"}"
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)

print(response.text.encode('utf8'))

唯一的问题是您必须更改响应中的日期格式.

The only thing is you have to change the date format in the response.

结果:

b'{"d":[{"kse_index_id":13362,"kse_index_type_id":1,"kse_index_date":"\\/Date(1577991600000)\\/","kse_index_open":30046.67,"kse_index_high":30053.64,"kse_index_low":29665.65,"kse_index_close":29774.00,"kse_index_value":322398592,"kse_index_change":-98.97,"kse_index_changep":-0.33},{"kse_index_id":13366,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578250800000)\\/","kse_index_open":29547.06,"kse_index_high":29774.00,"kse_index_low":29101.65,"kse_index_close":29145.52,"kse_index_value":266525664,"kse_index_change":-628.48,"kse_index_changep":-2.11},{"kse_index_id":13370,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578337200000)\\/","kse_index_open":29209.91,"kse_index_high":29393.74,"kse_index_low":29072.69,"kse_index_close":29375.75,"kse_index_value":206397936,"kse_index_change":230.23,"kse_index_changep":0.79},{"kse_index_id":13374,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578423600000)\\/","kse_index_open":29157.77,"kse_index_high":29375.75,"kse_index_low":28882.75,"kse_index_close":29010.85,"kse_index_value":279807072,"kse_index_change":-364.90,"kse_index_changep":-1.24},{"kse_index_id":13378,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578510000000)\\/","kse_index_open":29319.08,"kse_index_high":29667.92,"kse_index_low":29010.85,"kse_index_close":29654.66,"kse_index_value":361992128,"kse_index_change":643.81,"kse_index_changep":2.22},{"kse_index_id":13382,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578596400000)\\/","kse_index_open":29732.02,"kse_index_high":30070.99,"kse_index_low":29654.66,"kse_index_close":30058.45,"kse_index_value":400051936,"kse_index_change":403.79,"kse_index_changep":1.36},{"kse_index_id":13386,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578855600000)\\/","kse_index_open":30109.26,"kse_index_high":30194.74,"kse_index_low":29901.75,"kse_index_close":30020.98,"kse_index_value":365810592,"kse_index_change":-37.47,"kse_index_changep":-0.13},{"kse_index_id":13390,"kse_index_type_id":1,"kse_index_date":"\\/Date(1578942000000)\\/","kse_index_open":30059.23,"kse_index_high":30150.96,"kse_index_low":29932.22,"kse_index_close":29973.44,"kse_index_value":249556960,"kse_index_change":-47.54,"kse_index_changep":-0.16},{"kse_index_id":13394,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579028400000)\\/","kse_index_open":29986.93,"kse_index_high":29999.17,"kse_index_low":29799.04,"kse_index_close":29892.79,"kse_index_value":171127728,"kse_index_change":-80.65,"kse_index_changep":-0.27},{"kse_index_id":13398,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579114800000)\\/","kse_index_open":29913.22,"kse_index_high":30007.53,"kse_index_low":29779.46,"kse_index_close":29914.47,"kse_index_value":229585632,"kse_index_change":21.68,"kse_index_changep":0.07},{"kse_index_id":13402,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579201200000)\\/","kse_index_open":29929.81,"kse_index_high":30037.83,"kse_index_low":29914.46,"kse_index_close":29998.45,"kse_index_value":211220464,"kse_index_change":83.98,"kse_index_changep":0.28},{"kse_index_id":13406,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579460400000)\\/","kse_index_open":30043.65,"kse_index_high":30089.73,"kse_index_low":29734.95,"kse_index_close":29808.60,"kse_index_value":173774336,"kse_index_change":-189.85,"kse_index_changep":-0.63},{"kse_index_id":13410,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579546800000)\\/","kse_index_open":29856.28,"kse_index_high":29928.72,"kse_index_low":29621.78,"kse_index_close":29735.95,"kse_index_value":177421264,"kse_index_change":-72.65,"kse_index_changep":-0.24},{"kse_index_id":13414,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579633200000)\\/","kse_index_open":29746.05,"kse_index_high":29754.25,"kse_index_low":29308.76,"kse_index_close":29561.63,"kse_index_value":177486256,"kse_index_change":-174.32,"kse_index_changep":-0.59},{"kse_index_id":13418,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579719600000)\\/","kse_index_open":29621.60,"kse_index_high":29759.68,"kse_index_low":29409.24,"kse_index_close":29456.52,"kse_index_value":230561152,"kse_index_change":-105.11,"kse_index_changep":-0.36},{"kse_index_id":13422,"kse_index_type_id":1,"kse_index_date":"\\/Date(1579806000000)\\/","kse_index_open":29440.00,"kse_index_high":29585.39,"kse_index_low":29318.90,"kse_index_close":29529.89,"kse_index_value":172677024,"kse_index_change":73.37,"kse_index_changep":0.25},{"kse_index_id":13426,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580065200000)\\/","kse_index_open":29533.27,"kse_index_high":29594.55,"kse_index_low":29431.95,"kse_index_close":29462.60,"kse_index_value":198224992,"kse_index_change":-67.29,"kse_index_changep":-0.23},{"kse_index_id":13430,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580151600000)\\/","kse_index_open":29457.47,"kse_index_high":29462.59,"kse_index_low":29230.53,"kse_index_close":29345.90,"kse_index_value":188781760,"kse_index_change":-116.70,"kse_index_changep":-0.40},{"kse_index_id":13434,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580238000000)\\/","kse_index_open":29354.64,"kse_index_high":29446.90,"kse_index_low":29083.61,"kse_index_close":29135.35,"kse_index_value":197011200,"kse_index_change":-210.55,"kse_index_changep":-0.72},{"kse_index_id":13438,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580324400000)\\/","kse_index_open":29132.60,"kse_index_high":29181.59,"kse_index_low":28969.60,"kse_index_close":29123.53,"kse_index_value":162120016,"kse_index_change":-11.82,"kse_index_changep":-0.04},{"kse_index_id":13442,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580410800000)\\/","kse_index_open":29166.18,"kse_index_high":29257.79,"kse_index_low":28945.19,"kse_index_close":29067.54,"kse_index_value":193415040,"kse_index_change":-55.99,"kse_index_changep":-0.19},{"kse_index_id":13446,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580670000000)\\/","kse_index_open":28941.02,"kse_index_high":29067.54,"kse_index_low":28246.97,"kse_index_close":28315.61,"kse_index_value":202691712,"kse_index_change":-751.93,"kse_index_changep":-2.59},{"kse_index_id":13450,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580756400000)\\/","kse_index_open":28356.76,"kse_index_high":28506.86,"kse_index_low":28245.23,"kse_index_close":28493.84,"kse_index_value":145986304,"kse_index_change":178.23,"kse_index_changep":0.63},{"kse_index_id":13454,"kse_index_type_id":1,"kse_index_date":"\\/Date(1580929200000)\\/","kse_index_open":28577.12,"kse_index_high":28633.74,"kse_index_low":28375.60,"kse_index_close":28398.38,"kse_index_value":127719744,"kse_index_change":-95.46,"kse_index_changep":-0.34},{"kse_index_id":13458,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581015600000)\\/","kse_index_open":28458.74,"kse_index_high":28458.75,"kse_index_low":27983.62,"kse_index_close":28042.82,"kse_index_value":193151648,"kse_index_change":-355.56,"kse_index_changep":-1.25},{"kse_index_id":13462,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581274800000)\\/","kse_index_open":28043.58,"kse_index_high":28053.71,"kse_index_low":27470.38,"kse_index_close":27520.35,"kse_index_value":180630816,"kse_index_change":-522.47,"kse_index_changep":-1.86},{"kse_index_id":13466,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581361200000)\\/","kse_index_open":27601.00,"kse_index_high":28017.17,"kse_index_low":27492.28,"kse_index_close":27865.16,"kse_index_value":161458304,"kse_index_change":344.81,"kse_index_changep":1.25},{"kse_index_id":13470,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581447600000)\\/","kse_index_open":27959.20,"kse_index_high":28384.45,"kse_index_low":27865.16,"kse_index_close":28309.35,"kse_index_value":179861264,"kse_index_change":444.19,"kse_index_changep":1.59},{"kse_index_id":13474,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581534000000)\\/","kse_index_open":28380.58,"kse_index_high":28468.96,"kse_index_low":28191.97,"kse_index_close":28256.09,"kse_index_value":197307008,"kse_index_change":-53.26,"kse_index_changep":-0.19},{"kse_index_id":13478,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581620400000)\\/","kse_index_open":28327.55,"kse_index_high":28330.57,"kse_index_low":27917.81,"kse_index_close":28015.75,"kse_index_value":117521904,"kse_index_change":-240.34,"kse_index_changep":-0.85},{"kse_index_id":13482,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581879600000)\\/","kse_index_open":28023.74,"kse_index_high":28130.89,"kse_index_low":27900.27,"kse_index_close":28002.69,"kse_index_value":99813272,"kse_index_change":-13.06,"kse_index_changep":-0.05},{"kse_index_id":13486,"kse_index_type_id":1,"kse_index_date":"\\/Date(1581966000000)\\/","kse_index_open":28036.95,"kse_index_high":28141.44,"kse_index_low":27758.54,"kse_index_close":27807.10,"kse_index_value":91269288,"kse_index_change":-195.59,"kse_index_changep":-0.70},{"kse_index_id":13490,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582052400000)\\/","kse_index_open":27843.99,"kse_index_high":28108.02,"kse_index_low":27807.11,"kse_index_close":28063.85,"kse_index_value":142765888,"kse_index_change":256.75,"kse_index_changep":0.92},{"kse_index_id":13494,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582138800000)\\/","kse_index_open":28122.04,"kse_index_high":28132.98,"kse_index_low":27989.14,"kse_index_close":28018.02,"kse_index_value":111998784,"kse_index_change":-45.83,"kse_index_changep":-0.16},{"kse_index_id":13498,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582225200000)\\/","kse_index_open":28028.61,"kse_index_high":28039.38,"kse_index_low":27856.26,"kse_index_close":27895.15,"kse_index_value":85454400,"kse_index_change":-122.87,"kse_index_changep":-0.44},{"kse_index_id":13502,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582484400000)\\/","kse_index_open":27880.35,"kse_index_high":27895.15,"kse_index_low":27200.92,"kse_index_close":27248.30,"kse_index_value":144128160,"kse_index_change":-646.85,"kse_index_changep":-2.32},{"kse_index_id":13506,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582570800000)\\/","kse_index_open":27206.95,"kse_index_high":27321.33,"kse_index_low":26851.06,"kse_index_close":27018.98,"kse_index_value":124276016,"kse_index_change":-229.32,"kse_index_changep":-0.84},{"kse_index_id":13510,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582657200000)\\/","kse_index_open":27058.85,"kse_index_high":27070.75,"kse_index_low":26560.92,"kse_index_close":26687.95,"kse_index_value":147798160,"kse_index_change":-331.03,"kse_index_changep":-1.23},{"kse_index_id":13514,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582743600000)\\/","kse_index_open":26355.50,"kse_index_high":26687.95,"kse_index_low":25780.38,"kse_index_close":26396.96,"kse_index_value":248988672,"kse_index_change":-290.99,"kse_index_changep":-1.09},{"kse_index_id":13518,"kse_index_type_id":1,"kse_index_date":"\\/Date(1582830000000)\\/","kse_index_open":26302.05,"kse_index_high":26519.47,"kse_index_low":26181.00,"kse_index_close":26289.38,"kse_index_value":201662240,"kse_index_change":-107.58,"kse_index_changep":-0.41},{"kse_index_id":13522,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583089200000)\\/","kse_index_open":26342.71,"kse_index_high":27096.59,"kse_index_low":26289.38,"kse_index_close":27059.34,"kse_index_value":215058320,"kse_index_change":769.96,"kse_index_changep":2.93},{"kse_index_id":13526,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583175600000)\\/","kse_index_open":27200.11,"kse_index_high":27385.30,"kse_index_low":26854.16,"kse_index_close":27054.89,"kse_index_value":225222304,"kse_index_change":-4.45,"kse_index_changep":-0.02},{"kse_index_id":13530,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583262000000)\\/","kse_index_open":27070.16,"kse_index_high":27069.35,"kse_index_low":26797.32,"kse_index_close":26919.79,"kse_index_value":186877760,"kse_index_change":-135.10,"kse_index_changep":-0.50},{"kse_index_id":13534,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583348400000)\\/","kse_index_open":26961.15,"kse_index_high":27369.98,"kse_index_low":26919.79,"kse_index_close":27228.79,"kse_index_value":340043072,"kse_index_change":309.00,"kse_index_changep":1.15},{"kse_index_id":13538,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583434800000)\\/","kse_index_open":27126.48,"kse_index_high":27228.79,"kse_index_low":26517.64,"kse_index_close":26557.85,"kse_index_value":244063824,"kse_index_change":-670.94,"kse_index_changep":-2.46},{"kse_index_id":13542,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583694000000)\\/","kse_index_open":25878.94,"kse_index_high":26557.85,"kse_index_low":25304.60,"kse_index_close":25875.06,"kse_index_value":307753952,"kse_index_change":-682.79,"kse_index_changep":-2.57},{"kse_index_id":13546,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583780400000)\\/","kse_index_open":25758.62,"kse_index_high":26210.06,"kse_index_low":25719.55,"kse_index_close":26184.13,"kse_index_value":274065504,"kse_index_change":309.07,"kse_index_changep":1.19},{"kse_index_id":13550,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583866800000)\\/","kse_index_open":26331.02,"kse_index_high":26562.31,"kse_index_low":26061.81,"kse_index_close":26127.67,"kse_index_value":217595296,"kse_index_change":-56.46,"kse_index_changep":-0.22},{"kse_index_id":13554,"kse_index_type_id":1,"kse_index_date":"\\/Date(1583953200000)\\/","kse_index_open":26002.00,"kse_index_high":26127.67,"kse_index_low":25245.98,"kse_index_close":25310.97,"kse_index_value":230028032,"kse_index_change":-816.70,"kse_index_changep":-3.13}]}'

这篇关于使用 Selenium &amp; 抓取后从 HTML 源中提取数据表Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆