Selenium Loop 将多个表附加在一起 [英] Selenium Loop append multiple tables together

查看:35
本文介绍了Selenium Loop 将多个表附加在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是这里的新 Python 用户.一直在写代码,用selenium和beautifulsoup去一个网站获取html表,转成数据框.

I am a new python user here. I have been writing a code that uses selenium and beautiful soup to go to a website and get the html table and turn it into a data frame.

我正在使用 selenium 循环浏览许多不同的页面和漂亮的汤,以便从那里收集桌子.

I am using selenium to loop though a number of different pages and beautiful soup to collect the table from there.

我遇到的问题是我无法将所有这些表相互附加.如果我打印数据框,它只会打印最后一个被抓取的表.我如何告诉 beautifulsoup 将一个数据框附加到另一个的底部?

The issue that I am running into is I can't get all those tables to append to each other. If i print off the dataframe it only prints the last table that was scraped. How do I tell beautifulsoup to append one dataframe to the bottom of the other?

任何帮助将不胜感激,这一小部分已经过去了几天.

Any help would be greatly appreciated, it's been a couple days at this one little part.

states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia",
"Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", 
"Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire",
"New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", 
"Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", 
"Washington", "West Virginia", "Wisconsin", "Wyoming"]

period = "2020"

num_states = len(states)

state_list = []

for state in states:
    driver = webdriver.Chrome(executable_path = 'C:/webdrivers/chromedriver.exe')
    driver.get('https://www.nbc.gov/pilt/counties.cfm')
    driver.implicitly_wait(20)
    state_s = driver.find_element(By.NAME, 'state_code')
    drp = Select(state_s)
    drp.select_by_visible_text(state)
    year_s = driver.find_element(By.NAME, 'fiscal_yr')
    drp = Select(year_s)
    drp.select_by_visible_text(period)
    driver.implicitly_wait(10)
    link = driver.find_element(By.NAME, 'Search')
    link.click()
    url = driver.current_url
    page = requests.get(url)
    #dfs  = pd.read_html(addrss)[2]
    # Get the html
    soup = BeautifulSoup(page.text, 'lxml')
    table = soup.findAll('table')[2]
    headers = []

    for i in table.find_all('th'):
        title = i.text.strip()
        headers.append(title)

    df = pd.DataFrame(columns = headers)

    for row in table.find_all('tr')[1:]:
        data = row.find_all('td')
        row_data = [td.text.strip() for td in data]
        length = len(df)
        df.loc[length] = row_data
    df = pd.DataFrame.rename(columns={'Total Acres':'Total_acres'})
    for i in range(s,num_states):
        state_list.append([County[i].text, Payment[i].text, Total_acres[i].text])

print(df)

************************ 编辑 ***********************期间 =2020"

******************** EDIT *********************** period = "2020"

num_states = len(states)

num_states = len(states)

state_list = []

state_list = []

df = pd.DataFrame()

df = pd.DataFrame()

对于状态中的状态:driver = webdriver.Chrome(executable_path = 'C:/webdrivers/chromedriver.exe')driver.get('https://www.nbc.gov/pilt/counties.cfm')driver.implicitly_wait(20)state_s = driver.find_element(By.NAME, 'state_code')drp = 选择(state_s)drp.select_by_visible_text(状态)year_s = driver.find_element(By.NAME, 'fiscal_yr')drp = 选择(year_s)drp.select_by_visible_text(句点)driver.implicitly_wait(10)link = driver.find_element(By.NAME, '搜索')链接.点击()url = driver.current_url页面 = requests.get(url)#dfs = pd.read_html(addrss)[2]# 获取html汤 = BeautifulSoup(page.text, 'lxml')table = 汤.findAll('table')[2]标题 = []

for state in states: driver = webdriver.Chrome(executable_path = 'C:/webdrivers/chromedriver.exe') driver.get('https://www.nbc.gov/pilt/counties.cfm') driver.implicitly_wait(20) state_s = driver.find_element(By.NAME, 'state_code') drp = Select(state_s) drp.select_by_visible_text(state) year_s = driver.find_element(By.NAME, 'fiscal_yr') drp = Select(year_s) drp.select_by_visible_text(period) driver.implicitly_wait(10) link = driver.find_element(By.NAME, 'Search') link.click() url = driver.current_url page = requests.get(url) #dfs = pd.read_html(addrss)[2] # Get the html soup = BeautifulSoup(page.text, 'lxml') table = soup.findAll('table')[2] headers = []

for i in table.find_all('th'):
    title = i.text.strip()
    headers.append(title)


for row in table.find_all('tr')[1:]:
    data = row.find_all('td')
    row_data = [td.text.strip() for td in data]
    length = len(df)
    df.loc[length] = row_data


dfs = pd.concat([df for state in states])

打印(df)

结果:ValueError:无法设置没有定义列的框架

Results in: ValueError: cannot set a frame with no defined columns

推荐答案

通过pandas访问表!请参考已添加行的注释.

accessing table through pandas! pls refer the comment against lines which have been added.

states = ["Alabama", "Alaska"]

period = "2020"

num_states = len(states)

state_list = []
driver = webdriver.Chrome()
result=[] # change 1 , list to store the {state:df}
for state in states:
    
    driver.get('https://www.nbc.gov/pilt/counties.cfm')
    driver.implicitly_wait(20)
    state_s = driver.find_element(By.NAME, 'state_code')
    drp = Select(state_s)
    drp.select_by_visible_text(state)
    year_s = driver.find_element(By.NAME, 'fiscal_yr')
    drp = Select(year_s)
    drp.select_by_visible_text(period)
    driver.implicitly_wait(10)
    link = driver.find_element(By.NAME, 'Search')
    link.click()
    url = driver.current_url
    page = requests.get(url)
    temp_res={}
    soup = BeautifulSoup(driver.page_source, 'lxml')
    df_list=pd.read_html(soup.prettify(),thousands=',,') # access the table through pandas
    try:
        df_list[2].drop('PAYMENT.1', axis=1, inplace=True) # some states giving this column , so deleted
    except:
        print(f"state: {state} does have payment 1")
    try:
        df_list[2].drop('PAYMENT.2', axis=1, inplace=True)  # some states giving this column , so deleted
    except:
        print(f"state: {state} does have payment 2")
    temp_res[state]=df_list[2] # the table at occurance 2
    result.append(temp_res)

输出看起来像:

for each_run in result :
    for each_state in each_run:
        print(each_run[each_state].head(1))
 COUNTY PAYMENT TOTAL ACRES
0  AUTAUGA COUNTY  $4,971       1,758
                   COUNTY   PAYMENT TOTAL ACRES
0  ALEUTIANS EAST BOROUGH  $668,816   2,663,160

这篇关于Selenium Loop 将多个表附加在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆