Python - Selenium - 使用 WebDriverWait 在 html 中使用文本抓取表格 [英] Python - Selenium - webscrape table with text in html using WebDriverWait

查看:38
本文介绍了Python - Selenium - 使用 WebDriverWait 在 html 中使用文本抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从以下网站抓取 500 名或更多员工的所有公司名称:

https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=10000000000000es&revenueTo=10000000000000es&0mpe0ampe0ampe0ampe0ampe0ampe0000000000000000;sortMethod=revenueDesc&p=1

我编写了一个代码来抓取第一个站点的公司名称,然后脚本将单击下一个站点按钮".并再次刮掉名字.这些名字将被保存到一个列表中,直到列表中有一定数量的名字时才会发生这种情况.然后它将列表传输到数据帧并将其导出到 xslfile.不幸的是,它目前不这样做.这是代码

from selenium import webdriver从 selenium.webdriver.common.keys 导入密钥from selenium.webdriver.common.by import By从 selenium.webdriver.support.ui 导入 WebDriverWait将熊猫导入为 pd导入时间from selenium.webdriver.common.by import By从 selenium.webdriver.support 导入 expected_conditions 作为 EC公司列表 = []driver = webdriver.Chrome('/Users/rieder/Anaconda3/chromedriver_win32/chromedriver.exe')driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-10000000000000000000&revenueTo=100000000000000000000000000000000000000000000&employ0revenues=0&employ0revenue0&employ0revenue0&employ0revenue0&employ0revenue0&d1')driver.find_element_by_id(cookiesNotificationConfirm").click();而 len(company_list) <20:company_name = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='zebraTable zebraTable--companies']//following::tr[2]/td[@class='zebraTable__td zebraTable__td--companyName']/a"))).get_attribute("innerHTML")对于范围内的 p(len(company_name)):company_list.append(company_name)driver.find_element_by_xpath("///*[@id='content']/section[3]/div/div/form/div/div[2]/div[2]/div[2]/div/button[2]").click();打印(company_list)df = pd.DataFrame(company_list,columns =['Unternehmensname'])df.to_excel("output.xlsx")时间.sleep(5)

我的输出如下所示:

['\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk OberkirchAG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n 进步-我们rk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk OberkirchAG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\nProgress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk OberkirchAG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\nProgress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk OberkirchAG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ','\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ', '\n Progress-Werk Oberkirch AG\n ']

我认为这是因为 .get_attribute() 只获取一个属性,但此时我不知道如何获取所有属性.

inb4 谢谢

解决方案

是的,使用 .get_attribute() 一次只能获取一个属性.要获取所有属性,您可以在 javascript 代码下方:

driver.execute_script('var items = {}; for (index = 0; index < arguments[0].attributes.length; ++index) { items[arguments[0].attributes[index]].name] = arguments[0].attributes[index].value }; return items;', ele)

这里 ele 是你的网页元素.

要打印所有公司名称,您可以使用以下方法:

company_names = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td[@class='zebraTable__td zebraTable__td--companyName']")))对于 company_names 中的 cn:打印(cn.text)

注意: 它将在第一页打印所有公司名称.如果您想从所有页面获取名称,则需要单击每个页面上的下一页图标,然后循环单击上面的代码.

I try to webscrape all the Company Names with 500 or more employees of the following website:

https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1

I wrote a code to scrape the Company Names of the the first site and the script will then click on the "Next Site Button" and scrape again the names. The names will be saved into a list, and this will happen until the list has a certain number of names in it. Then it will transfer the list into a dataframe and export it into an xslfile. Unfortunately it does not do this at the moment. Here is the Code

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

company_list = []

driver = webdriver.Chrome('/Users/rieder/Anaconda3/chromedriver_win32/chromedriver.exe')

driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1')

driver.find_element_by_id("cookiesNotificationConfirm").click();

while len(company_list) < 20:

    company_name = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='zebraTable zebraTable--companies']//following::tr[2]/td[@class='zebraTable__td zebraTable__td--companyName']/a"))).get_attribute("innerHTML")
    
    for p in range(len(company_name)):
        company_list.append(company_name)
        
    driver.find_element_by_xpath("//*[@id='content']/section[3]/div/div/form/div/div[2]/div[2]/div[2]/div/button[2]").click();
              
    print(company_list)

    df = pd.DataFrame(company_list,columns =['Unternehmensname']) 

    df.to_excel("output.xlsx")  
            
    time.sleep(5)

And my Output looks like this:

['\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ', '\n                        Progress-Werk Oberkirch AG\n                    ']

I think its because the .get_attribute() only gets one attribute, but i dont know how to get all the attributes at this points.

inb4 Thanks

解决方案

Yes using .get_attribute() you can only get one attribute at a time. To get all attributes you can below javascript code:

driver.execute_script('var items = {}; for (index = 0; index < arguments[0].attributes.length; ++index) { items[arguments[0].attributes[index].name] = arguments[0].attributes[index].value }; return items;', ele)

Here ele is your webelement.

To Print all the company name you can use below approach:

company_names = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td[@class='zebraTable__td zebraTable__td--companyName']")))
for cn in company_names:
    print(cn.text)

Note : It will print all the company names on first page. If you want to get names from all the page then you need to click on next page icon on each page and click above code in a loop.

这篇关于Python - Selenium - 使用 WebDriverWait 在 html 中使用文本抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆