使用Selenium和Python将表数据提取到pandas数据框中 [英] Extracting Table data using Selenium and Python into pandas dataframe
问题描述
所以我已经使用库BeautifulSoup从表中提取了数据,代码如下:
so I have done data extract from a table using library BeautifulSoup with code below:
if soup.find("table", {"class":"a-keyvalue prodDetTable"}) is not None:
table = parse_table(soup.find("table", {"class":"a-keyvalue prodDetTable"}))
df = pd.DataFrame(table)
因此有效,我将表nad解析为数据框,但是我正在尝试使用硒在其他网站上做类似的事情,到目前为止,这是我的代码:
So this worked, I get the table nad parse it out into dataframe, however i am trying to do something similar on different website using selenium and here is my code so far:
driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
table = driver.find_element_by_xpath("//*[@id='collapseSpecs']/div/div/div[1]/table/tbody")
因此,我进入表格,尝试使用getAttribute(innerHTML)和其他一些getAttribute元素,但是我无法按原样进入熊猫表.关于如何处理硒有任何建议吗?
So I am getting to the table and I tried to use getAttribute(innerHTML) and some other getAttribute elements but I am unable to get the table as is into pandas. Any suggestions on how to handle that with selenium?
这是html的外观:
Here is how html looks:
推荐答案
使用 pandas 提取表.尝试以下代码.
Use pandas to fetch the tables. Try following code.
import pandas as pd
import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
time.sleep(3)
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
div=soup.select_one("div#collapseSpecs")
table=pd.read_html(str(div))
print(table[0])
print(table[1])
输出:
0 1
0 Battery Amp Hours 1.3
1 Tool Power Output 189 UWO
2 Side Handle Included No
3 Number of Clutch Settings 15
4 Case Type Soft
5 Series Name NaN
6 Tool Weight (lbs.) 2.2
7 Tool Length (Inches) 7.5
8 Tool Width (Inches) 2.0
9 Tool Height (Inches) 7.75
10 Forward and Reverse Switch Included Yes
11 Sub-Brand NaN
12 Battery Type Lithium ion (Li-ion)
13 Battery Voltage 12-volt max
14 Charger Included Yes
15 Variable Speed Yes
0 1
0 Maximum Chuck Size 3/8-in
1 Number of Batteries Included 2
2 Battery Warranty 3-year limited
3 Maximum Speed (RPM) 1500.0
4 Bluetooth Compatibility No
5 Charge Time (Minutes) 40
6 App Compatibility No
7 Works with iOS No
8 Brushless No
9 CA Residents: Prop 65 Warning(s) Yes
10 Tool Warranty 3-year limited
11 UNSPSC 27112700
12 Works with Android No
13 Battery Included Yes
14 Right Angle No
15 Wi-Fi Compatibility No
如果您要单个数据框,请尝试此操作.
If you want single dataframe try this.
import pandas as pd
import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
time.sleep(3)
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
div=soup.select_one("div#collapseSpecs")
table=pd.read_html(str(div))
frames = [table[0], table[1]]
result=pd.concat(frames,ignore_index=True)
print(result)
带有pandas Dataframe的硒选项.
Selenium options with pandas Dataframe.
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
spec_name=[]
spec_item=[]
driver = webdriver.Chrome()
i = "DCD710S2"
base_url = str("https://www.lowes.com/search?searchTerm=" + str(i))
driver.get(base_url)
tables=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@id='collapseSpecs']//table")))
for table in tables:
for row in table.find_elements_by_xpath(".//tr"):
spec_name.append(row.find_element_by_xpath('./th').get_attribute('textContent'))
spec_item.append(row.find_element_by_xpath('./td/span').get_attribute('textContent'))
df = pd.DataFrame({"Spec_Name":spec_name,"Spec_Title":spec_item})
print(df)
这篇关于使用Selenium和Python将表数据提取到pandas数据框中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!