Selenium:从 Coincodex 抓取网页历史数据并转换为 Pandas 数据框 [英] Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe

查看:25
本文介绍了Selenium:从 Coincodex 抓取网页历史数据并转换为 Pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试使用来自 https 的 Selenium 从多个站点抓取一些历史数据时,我确实很挣扎://coincodex.com/crypto/bitcoin/historical-data/.不知何故,我在以下步骤中失败了:

I do struggle when trying to scrape some historical data from several sites with Selenium from https://coincodex.com/crypto/bitcoin/historical-data/. Somehow I do fail with the following steps:

  1. 从后续页面中获取数据(不仅是 9 月,也就是第 1 页)
  2. 将每个值的 '$' 替换为 '$'
  3. 将值 B(十亿)转换为完整数字(1B 转换为 1000000000)

预定义的任务是:使用 Selenium 和 BeautifulSoup 从年初到 9 月底 Web 抓取所有数据,并将其转换为 Pandas df.到目前为止,我的代码是:

The predefined task is: Web-Scrape all data since beginning of the year until end of September with Selenium and BeautifulSoup and transform into a pandas df. My code so far is:

from selenium import webdriver
import time

URL = "https://coincodex.com/crypto/bitcoin/historical-data/"

driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")
driver.get(URL)
time.sleep(2)

webpage = driver.page_source

from bs4 import BeautifulSoup
Web page fetched from driver is parsed using Beautiful Soup.
HTMLPage = BeautifulSoup(driver.page_source, 'html.parser')

Table = HTMLPage.find('table', class_='styled-table full-size-table')

Rows = Table.find_all('tr', class_='ng-star-inserted')
len(Rows)

# Empty list is created to store the data
extracted_data = []
# Loop to go through each row of table
for i in range(0, len(Rows)):
 try:
  # Empty dictionary to store data present in each row
  RowDict = {}
  # Extracted all the columns of a row and stored in a variable
  Values = Rows[i].find_all('td')
  
  # Values (Open, High, Close etc.) are extracted and stored in dictionary
  if len(Values) == 7:
   RowDict["Date"] = Values[0].text.replace(',', '')
   RowDict["Open"] = Values[1].text.replace(',', '')
   RowDict["High"] = Values[2].text.replace(',', '')
   RowDict["Low"] = Values[3].text.replace(',', '')
   RowDict["Close"] = Values[4].text.replace(',', '')
   RowDict["Volume"] = Values[5].text.replace(',', '')
   RowDict["Market Cap"] = Values[6].text.replace(',', '')
   extracted_data.append(RowDict)
 except:
  print("Row Number: " + str(i))
 finally:
  # To move to the next row
  i = i + 1

extracted_data = pd.DataFrame(extracted_data)
print(extracted_data)

抱歉,我是 Python 和 Web-Scraping 的新手,希望有人能帮助我.将不胜感激.

Sorry I'm new to Python and Web-Scraping and I hope, someone can help me. Would be very much appreciated.

推荐答案

从网站 Coincodex 并将它们打印在您需要诱导 WebDriverWait 用于 visibility_of_all_elements_located() 然后使用 列表理解你可以创建一个列表,随后创建一个 DataFrame,最后将值导出到 文本 文件,使用以下定位器策略排除索引:

To extract Bitcoin (BTC) Historical Data from all the seven columns from the website Coincodex and print them in a text file you need to induce WebDriverWait for the visibility_of_all_elements_located() and then using List Comprehension you can create a list and subsequently create a DataFrame and finally export the values to a TEXT file excluding the Index using the following Locator Strategies:

代码块:

driver.get("https://coincodex.com/crypto/bitcoin/historical-data/")
headers = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table th")))]
dates = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(1)")))]
opens = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(2)")))]
highs = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(3)")))]
lows = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(4)")))]
closes = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(5)")))]
volumes = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(6)")))]
marketcaps = [my_elem.text.replace('u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(7)")))]
df = pd.DataFrame(data=list(zip(dates, opens, highs, lows, closes, volumes, marketcaps)), columns=headers)
print(df)
driver.quit()

注意:您必须添加以下导入:

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
    

控制台输出:

            Date      Open      High       Low     Close     Volume Market Cap
0   Oct 30, 2021  $ 62,225  $ 62,225  $ 60,860  $ 61,661   $ 82.73B    $ 1.16T
1   Oct 31, 2021  $ 61,856  $ 62,379  $ 60,135  $ 61,340   $ 74.91B    $ 1.15T
2   Nov 01, 2021  $ 61,290  $ 62,368  $ 59,675  $ 61,065   $ 76.19B    $ 1.16T
3   Nov 02, 2021  $ 60,939  $ 64,071  $ 60,682  $ 63,176   $ 74.05B    $ 1.18T
4   Nov 03, 2021  $ 63,167  $ 63,446  $ 61,653  $ 62,941   $ 78.08B    $ 1.18T
5   Nov 04, 2021  $ 62,907  $ 63,048  $ 60,740  $ 61,368   $ 91.06B    $ 1.17T
6   Nov 05, 2021  $ 61,419  $ 62,480  $ 60,770  $ 61,026   $ 78.06B    $ 1.16T
7   Nov 06, 2021  $ 60,959  $ 61,525  $ 60,083  $ 61,416   $ 67.75B    $ 1.15T
8   Nov 07, 2021  $ 61,454  $ 63,180  $ 61,333  $ 63,180   $ 51.66B    $ 1.17T
9   Nov 08, 2021  $ 63,278  $ 67,670  $ 63,278  $ 67,500   $ 74.25B    $ 1.24T
10  Nov 09, 2021  $ 67,511  $ 68,476  $ 66,359  $ 66,913   $ 87.83B    $ 1.27T
11  Nov 10, 2021  $ 66,929  $ 68,770  $ 63,348  $ 64,871   $ 82.52B    $ 1.26T
12  Nov 11, 2021  $ 64,934  $ 65,580  $ 64,199  $ 64,800  $ 100.84B    $ 1.22T
13  Nov 12, 2021  $ 64,774  $ 65,380  $ 62,434  $ 64,315   $ 71.88B    $ 1.21T
14  Nov 13, 2021  $ 64,174  $ 64,850  $ 63,413  $ 64,471   $ 65.34B    $ 1.21T
15  Nov 14, 2021  $ 64,385  $ 65,255  $ 63,623  $ 65,255   $ 59.25B    $ 1.22T
16  Nov 15, 2021  $ 65,500  $ 66,263  $ 63,540  $ 63,716   $ 92.91B    $ 1.23T
17  Nov 16, 2021  $ 63,610  $ 63,610  $ 58,904  $ 60,190  $ 103.18B    $ 1.15T
18  Nov 17, 2021  $ 60,111  $ 60,734  $ 58,758  $ 60,339   $ 96.57B    $ 1.13T
19  Nov 18, 2021  $ 60,348  $ 60,863  $ 56,542  $ 56,749   $ 86.65B    $ 1.11T
20  Nov 19, 2021  $ 56,960  $ 58,289  $ 55,653  $ 58,047   $ 98.57B    $ 1.08T
21  Nov 20, 2021  $ 58,069  $ 59,815  $ 57,486  $ 59,815   $ 61.67B    $ 1.11T
22  Nov 21, 2021  $ 59,670  $ 59,845  $ 58,545  $ 58,681   $ 54.40B    $ 1.12T
23  Nov 22, 2021  $ 58,712  $ 59,061  $ 55,689  $ 56,370   $ 64.89B    $ 1.08T
24  Nov 23, 2021  $ 56,258  $ 57,832  $ 55,778  $ 57,673   $ 80.27B    $ 1.07T
25  Nov 24, 2021  $ 57,531  $ 57,694  $ 55,970  $ 57,103   $ 92.08B    $ 1.07T
26  Nov 25, 2021  $ 57,193  $ 59,333  $ 57,011  $ 58,907   $ 85.14B    $ 1.10T
27  Nov 26, 2021  $ 58,914  $ 59,120  $ 53,660  $ 53,664   $ 90.87B    $ 1.05T
28  Nov 27, 2021  $ 53,559  $ 55,204  $ 53,559  $ 54,487   $ 85.68B    $ 1.03T
29  Nov 28, 2021  $ 54,819  $ 57,315  $ 53,630  $ 57,159   $ 72.40B    $ 1.03T


参考文献

您可以在以下位置找到相关的详细讨论:


References

You can find a relevant detailed discussion in:

这篇关于Selenium:从 Coincodex 抓取网页历史数据并转换为 Pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆