Selenium:从 Coincodex 抓取网页历史数据并转换为 Pandas 数据框 [英] Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe
问题描述
当我尝试使用来自 https 的 Selenium 从多个站点抓取一些历史数据时,我确实很挣扎://coincodex.com/crypto/bitcoin/historical-data/.不知何故,我在以下步骤中失败了:
I do struggle when trying to scrape some historical data from several sites with Selenium from https://coincodex.com/crypto/bitcoin/historical-data/. Somehow I do fail with the following steps:
- 从后续页面中获取数据(不仅是 9 月,也就是第 1 页)
- 将每个值的 '$' 替换为 '$'
- 将值 B(十亿)转换为完整数字(1B 转换为 1000000000)
预定义的任务是:使用 Selenium 和 BeautifulSoup 从年初到 9 月底 Web 抓取所有数据,并将其转换为 Pandas df.到目前为止,我的代码是:
The predefined task is: Web-Scrape all data since beginning of the year until end of September with Selenium and BeautifulSoup and transform into a pandas df. My code so far is:
from selenium import webdriver
import time
URL = "https://coincodex.com/crypto/bitcoin/historical-data/"
driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")
driver.get(URL)
time.sleep(2)
webpage = driver.page_source
from bs4 import BeautifulSoup
Web page fetched from driver is parsed using Beautiful Soup.
HTMLPage = BeautifulSoup(driver.page_source, 'html.parser')
Table = HTMLPage.find('table', class_='styled-table full-size-table')
Rows = Table.find_all('tr', class_='ng-star-inserted')
len(Rows)
# Empty list is created to store the data
extracted_data = []
# Loop to go through each row of table
for i in range(0, len(Rows)):
try:
# Empty dictionary to store data present in each row
RowDict = {}
# Extracted all the columns of a row and stored in a variable
Values = Rows[i].find_all('td')
# Values (Open, High, Close etc.) are extracted and stored in dictionary
if len(Values) == 7:
RowDict["Date"] = Values[0].text.replace(',', '')
RowDict["Open"] = Values[1].text.replace(',', '')
RowDict["High"] = Values[2].text.replace(',', '')
RowDict["Low"] = Values[3].text.replace(',', '')
RowDict["Close"] = Values[4].text.replace(',', '')
RowDict["Volume"] = Values[5].text.replace(',', '')
RowDict["Market Cap"] = Values[6].text.replace(',', '')
extracted_data.append(RowDict)
except:
print("Row Number: " + str(i))
finally:
# To move to the next row
i = i + 1
extracted_data = pd.DataFrame(extracted_data)
print(extracted_data)
抱歉,我是 Python 和 Web-Scraping 的新手,希望有人能帮助我.将不胜感激.
Sorry I'm new to Python and Web-Scraping and I hope, someone can help me. Would be very much appreciated.
推荐答案
从网站 Coincodex 并将它们打印在您需要诱导 WebDriverWait 用于 visibility_of_all_elements_located() 然后使用 列表理解你可以创建一个列表,随后创建一个 DataFrame,最后将值导出到 文本 文件,使用以下定位器策略排除索引:
To extract Bitcoin (BTC) Historical Data from all the seven columns from the website Coincodex and print them in a text file you need to induce WebDriverWait for the visibility_of_all_elements_located() and then using List Comprehension you can create a list and subsequently create a DataFrame and finally export the values to a TEXT file excluding the Index using the following Locator Strategies:
代码块:
driver.get("https://coincodex.com/crypto/bitcoin/historical-data/")
headers = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table th")))]
dates = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(1)")))]
opens = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(2)")))]
highs = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(3)")))]
lows = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(4)")))]
closes = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(5)")))]
volumes = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(6)")))]
marketcaps = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(7)")))]
df = pd.DataFrame(data=list(zip(dates, opens, highs, lows, closes, volumes, marketcaps)), columns=headers)
print(df)
driver.quit()
注意:您必须添加以下导入:
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
控制台输出:
Date Open High Low Close Volume Market Cap
0 Oct 30, 2021 $ 62,225 $ 62,225 $ 60,860 $ 61,661 $ 82.73B $ 1.16T
1 Oct 31, 2021 $ 61,856 $ 62,379 $ 60,135 $ 61,340 $ 74.91B $ 1.15T
2 Nov 01, 2021 $ 61,290 $ 62,368 $ 59,675 $ 61,065 $ 76.19B $ 1.16T
3 Nov 02, 2021 $ 60,939 $ 64,071 $ 60,682 $ 63,176 $ 74.05B $ 1.18T
4 Nov 03, 2021 $ 63,167 $ 63,446 $ 61,653 $ 62,941 $ 78.08B $ 1.18T
5 Nov 04, 2021 $ 62,907 $ 63,048 $ 60,740 $ 61,368 $ 91.06B $ 1.17T
6 Nov 05, 2021 $ 61,419 $ 62,480 $ 60,770 $ 61,026 $ 78.06B $ 1.16T
7 Nov 06, 2021 $ 60,959 $ 61,525 $ 60,083 $ 61,416 $ 67.75B $ 1.15T
8 Nov 07, 2021 $ 61,454 $ 63,180 $ 61,333 $ 63,180 $ 51.66B $ 1.17T
9 Nov 08, 2021 $ 63,278 $ 67,670 $ 63,278 $ 67,500 $ 74.25B $ 1.24T
10 Nov 09, 2021 $ 67,511 $ 68,476 $ 66,359 $ 66,913 $ 87.83B $ 1.27T
11 Nov 10, 2021 $ 66,929 $ 68,770 $ 63,348 $ 64,871 $ 82.52B $ 1.26T
12 Nov 11, 2021 $ 64,934 $ 65,580 $ 64,199 $ 64,800 $ 100.84B $ 1.22T
13 Nov 12, 2021 $ 64,774 $ 65,380 $ 62,434 $ 64,315 $ 71.88B $ 1.21T
14 Nov 13, 2021 $ 64,174 $ 64,850 $ 63,413 $ 64,471 $ 65.34B $ 1.21T
15 Nov 14, 2021 $ 64,385 $ 65,255 $ 63,623 $ 65,255 $ 59.25B $ 1.22T
16 Nov 15, 2021 $ 65,500 $ 66,263 $ 63,540 $ 63,716 $ 92.91B $ 1.23T
17 Nov 16, 2021 $ 63,610 $ 63,610 $ 58,904 $ 60,190 $ 103.18B $ 1.15T
18 Nov 17, 2021 $ 60,111 $ 60,734 $ 58,758 $ 60,339 $ 96.57B $ 1.13T
19 Nov 18, 2021 $ 60,348 $ 60,863 $ 56,542 $ 56,749 $ 86.65B $ 1.11T
20 Nov 19, 2021 $ 56,960 $ 58,289 $ 55,653 $ 58,047 $ 98.57B $ 1.08T
21 Nov 20, 2021 $ 58,069 $ 59,815 $ 57,486 $ 59,815 $ 61.67B $ 1.11T
22 Nov 21, 2021 $ 59,670 $ 59,845 $ 58,545 $ 58,681 $ 54.40B $ 1.12T
23 Nov 22, 2021 $ 58,712 $ 59,061 $ 55,689 $ 56,370 $ 64.89B $ 1.08T
24 Nov 23, 2021 $ 56,258 $ 57,832 $ 55,778 $ 57,673 $ 80.27B $ 1.07T
25 Nov 24, 2021 $ 57,531 $ 57,694 $ 55,970 $ 57,103 $ 92.08B $ 1.07T
26 Nov 25, 2021 $ 57,193 $ 59,333 $ 57,011 $ 58,907 $ 85.14B $ 1.10T
27 Nov 26, 2021 $ 58,914 $ 59,120 $ 53,660 $ 53,664 $ 90.87B $ 1.05T
28 Nov 27, 2021 $ 53,559 $ 55,204 $ 53,559 $ 54,487 $ 85.68B $ 1.03T
29 Nov 28, 2021 $ 54,819 $ 57,315 $ 53,630 $ 57,159 $ 72.40B $ 1.03T
参考文献
您可以在以下位置找到相关的详细讨论:
References
You can find a relevant detailed discussion in:
这篇关于Selenium:从 Coincodex 抓取网页历史数据并转换为 Pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!